home *** CD-ROM | disk | FTP | other *** search
Text File | 1995-03-10 | 160.2 KB | 5,767 lines | [TEXT/MPS ] |
-
-
-
-
-
-
-
- PCCTS Reference Manual
-
-
- Version 1.00
-
-
- T. J. Parr, H. G. Dietz, and W. E. Cohen
-
- School of Electrical Engineering
- Purdue University
- West Lafayette, IN 47907
- August 1991
- pccts@ecn.purdue.edu
-
- [Note: please see release notes for all versions since 1.00]
- [Note: The Postscript version is much more readable]
-
-
-
- This document describes the structure and use of the
- Purdue Compiler-Construction Tool Set (PCCTS), a set of pub-
- lic domain software tools designed to facilitate the imple-
- mentation of compilers and other translation systems.
-
- Distribution of PCCTS is managed by an email server
- that reads the Subject: line of each email sent to
- pccts@ecn.purdue.edu. A missing or undeci-
- pherable Subject: line will cause the system to reply with
- email explaining how to use the server. Full C source code
- for PCCTS, various examples, and on-line documentation are
- available.
-
- PCCTS was developed within the School of Electrical
- Engineering at Purdue University, supported in part by NSF
- award number 8624385A3-CDR. As a public domain software
- release, Purdue University and the authors provide the sys-
- tem strictly on an "as-is" basis, without warranty or lia-
- bility. Despite this, bug reports and fixes are welcome.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 1. Introduction
-
- This document describes structure and use of the Purdue
- Compiler-Construction Tool Set (PCCTS). In many ways, PCCTS is simi-
- lar to a highly integrated version of YACC [Joh78] and LEX [Les75];
- where ANTLR (ANother Tool for Language Recognition) corresponds to
- YACC and DLG (DFA-based Lexical analyzer Generator) functions like
- LEX. However, PCCTS has many additional features which make it easier
- to use for a wide range of translation problems.
-
- PCCTS grammars contain specifications for lexical and syntactic
- analysis, intermediate-form construction and error reporting. Future
- releases will include an integrated code-generator generator to sup-
- port code-generation and other intermediate-form translations. Rules
- may employ Extended BNF (EBNF) grammar constructs and may define
- parameters, return values and local variables. Languages described in
- PCCTS are recognized via Strong LL(k) parsers constructed in pure,
- human-readable, C code.
-
- The Version 1.00 Release PCCTS package is considered public-
- domain and includes the following items:
-
-
-
- o C source code for antlr - parser generator.
-
- o C source code for dlg - DFA-based lexical analyzer generator.
-
- o C source for a symbol table manager with arbitrary scoping capa-
- bilities.
-
- o Grammars for ISO PASCAL and ANSI C with symbol table management.
-
- o C source for rexpr(char *expr, char *s) which answers whether s
- is in the language (regular expression) described by expr.
-
- o On-line documents - PCCTS Version 1.00 Reference Manual
-
-
- PCCTS was developed within the School of Electrical Engineering
- at Purdue University to aid in constructing various prototype and spe-
- cialized compilers. After using the system for over a year, we
- decided that it was mature enough to warrant release to a few test
- sites. Hence, in Spring 1990, an experimental version of the system,
- known as PCCTS B, was made available via anonymous ftp. Although the
- system was only announced via two short postings in network news-
- groups, by January 1991, there were over 450 known user sites world-
- wide - and we were hopelessly behind in responding to requests for
- hardcopy manuals.
-
- Now, only a year older but much wiser, we are finally ready for a
- full release of PCCTS - Version 1.00. Experience has proven that we
- are not able to deal with high-quantity hardcopy requests; instead,
- the basic documentation (this document) is being printed by SIGPLAN
-
-
-
- Page 1
-
- PCCTS 1.00
-
-
- Notices and all documentation is available on-line.
-
- The rest of this section describes the overall characteristics of
- PCCTS including its programming interface, translation to C code and
- input file format. Also, an example is presented to illustrate the
- development of parsers using PCCTS.
-
- 1.1. PCCTS programming interface
-
- PCCTS accepts language descriptions in the form of a set of gram-
- matical rules and token/lexeme definitions. Significant emphasis was
- placed on making the PCCTS programming interface simple and flexible.
- Much of the notation was derived from previous language tools and/or
- programming languages in order to bring a sense of familiarity and to
- attenuate the learning curve. For example, attribute notation ($i)
- and grammar syntax were derived mainly from YACC, though they have
- been augmented for use in the EBNF environment. Rules may define
- return types, parameter lists and local variables just as in a pro-
- gramming language. A mechanism for describing abstract-syntax-trees
- is also provided.
-
- PCCTS Grammars tend to be smaller and more readable than grammars
- written for other parser-generators because of the EBNF notation and
- because precedence rules are defined inherently within the grammar
- rather than delineated via pseudo-instructions (e.g. %left, %right in
- YACC). In addition, PCCTS translators (parser plus actions) are gen-
- erally easier to develop because of the extensive programming support.
-
- For example, init-actions allow the user to define local stack-
- based variables or execute a piece of C initialization code. Often, a
- distinct copy of a variable is required for each invocation of a rule.
- This requires the overhead of a software stack when using other
- language tools. PCCTS constructs a function for each rule which
- allows each rule invocation to automatically create its own copy of
- the user variables on the hardware stack.
-
- PCCTS parsers are constructed in pure, human-readable, C code.
- As a result, standard debugging tools can be used to trace and debug
- PCCTS parsers. Breakpoints can be set so that parser execution stops
- before or after certain grammar fragments of interest have been recog-
- nized. PCCTS parsers consist of if, and while statements which can be
- easily traced by a human observer.
-
- PCCTS supports intermediate-form (such as expression-trees) con-
- struction via a flexible abstract-syntax-tree (AST) mechanism which
- allows trees to be built explicitly or automatically. The user expli-
- citly creates trees via a LISP-like tree constructor or directs the
- automatic tree construction facility via simple grammar directives.
- AST tree nodes are user-defined and are generally a function of attri-
- butes or terminals. A default transformation from
- attributes/terminals ($-variables) to AST nodes can be specified.
- Alternatively, each tree node can be defined explicitly via an AST
- node constructor.
-
-
-
-
- Page 2
-
- PCCTS 1.00
-
-
- 1.2. PCCTS information flow
-
- PCCTS currently consists of ANTLR (ANother Tool for Language
- Recognition) and DLG (DFA-based lexical analyzer generator) which con-
- struct parsers and lexical analyzers respectively.
-
- ANTLR accepts as input a complete language description including
- rules for both lexical and syntactic analysis and produces a C program
- that recognizes sentences in that language. A C file is constructed
- for each grammar description file. Also, three additional files -
- err.c, tokens.h and parser.dlg (all of which can be renamed) are gen-
- erated as support files. err.c contains error information and token
- class definitions. tokens.h is a list of #define's representing all
- of the token names defined or referenced in the grammar description
- files as well as function prototypes for rules that have return types.
- parser.dlg is a specification of the lexical analyzer generator
- derived from the user's language description and is translated to C by
- DLG.
-
- DLG creates an automaton that breaks up the input character
- stream into tokens used by the ANTLR-generated parser and it creates
- mode.h which is a list of #define's corresponding to the separate
- automatons defined by your grammar.
-
- The following is a graphical representation of the information
- flow during translation from parser description to executable parser.
-
-
- [Figure omitted]
-
-
-
-
-
- _______________________________________________________________________
- _______________________________________________________________________|
- |f1 ... fn ||Input ANTLR grammar description files |
- |______________||______________________________________________________|
- |______________||______________________________________________________|
- |tokens.h ||List of token names found in f1 ... fn |
- |______________||______________________________________________________|
- |______________||______________________________________________________|
- |parser.dlg ||Scanner description prepared for DLG |
- |______________||______________________________________________________|
- |______________||______________________________________________________|
- |mode.h ||Definitions for lexical modes (#lexclass's) |
- |______________||______________________________________________________|
- |______________||______________________________________________________|
-
-
-
- 1.3. PCCTS file(s) format
-
- PCCTS descriptions adhere to a fairly flexible format and can be
-
-
-
- Page 3
-
- PCCTS 1.00
-
-
- spread out over many files. The actual grammar for PCCTS input can be
- defined in itself as:
-
-
-
-
- grammar : { "#header" Action }
- ( Action | tokenDef | eclassDef )*
- ( rule | tokenDef | eclassDef )+
- ( Action | tokenDef | eclassDef )*
- Eof
- ;
-
-
-
- where Action is a user action (C code), tokenDef is a #token defini-
- tion and eclassDef is an #errclass definition. Lexical actions found
- in a language description are passed on to the parser.dlg file. The (
- )* and ( )+ EBNF operators indicate 0 or more and 1 or more respec-
- tively. { } implies that the items within are optional. Rule grammar
- indicates that a PCCTS file has a #header action first (if at all)
- followed by actions, token definitions and error class definitions
- with the except that actions may not appear between rules. Also, at
- least one rule must be defined. There is no start symbol specifica-
- tion since any rule can be invoked first.
-
- The minimum required user C code is a main program that initial-
- izes the PCCTS-generated parser and calls the starting rule. The lex-
- ical analyzer is derived automatically (from the user's language
- description) by ANTLR and a user routine is not necessary to supply
- tokens to the parser. The main program does not need to be located in
- a PCCTS grammar file.
-
- ANTLR descriptions may be broken up into many different files,
- but the sequence mentioned above in grammar must be maintained. For
- example, if file f1.g contained
-
-
-
-
- #header <<#include "int.h">>
-
- << main() { ANTLR(start(), stdin); } >>
-
-
-
- and file f2.g contained
-
-
-
-
- start : "begin" VAR "=" NUM ";" "end" "." "@" ;
-
-
-
-
-
- Page 4
-
- PCCTS 1.00
-
-
- and file f3.g contained
-
-
-
-
- #token VAR "[a-z]+"
- #token NUM "[0-9]+"
-
-
-
- the correct ANTLR invocation would be
-
-
-
-
- antlr f1.g f2.g f3.g
-
-
-
- Note that the order of files f2.g and f3.g could be switched because a
- rule is either a #token or an actual rule definition. In this case,
- to comply with ANTLR's grammar, the only restriction is that file f1.g
- must be mentioned first on the command line.
-
- Other files may be included into the parser files generated by
- ANTLR via actions containing a #include directive. For example,
-
-
-
-
- <<#include "support_code.h">>
-
-
-
- If a file must be included in all parser files generated by ANTLR, the
- #include directive must be placed in the #header action. In other
- words,
-
-
-
-
- #header <<#include "necessary_type_defs_for_all_files.h">>
-
-
-
- Note that #include's can be used to define any ANTLR object (Attrib,
- AST, etc...) by placing it in the #header action.
-
- 1.4. Makefile template
-
- The template presented in this section can be used to construct
- makefiles for single-file grammars. The order of execution will fol-
- low that illustrated in section 1.2.
-
-
-
-
- Page 5
-
- PCCTS 1.00
-
-
-
-
-
-
- #
- # Makefile for a single-file grammar
- #
- ANTLR_H= path to antlr.h, dlgdef.h, dlgauto.h, standard attribute defs
- CFLAGS= -I$(ANTLR_H) -DLL_K = $(K)
- AFLAGS=
- GRM=user_grammar_file_name
- SRC=scan.c $(GRM).c err.c
- OBJ=scan.o $(GRM).o err.o
- K=number_of_tokens_of_lookahead
-
- # build an executable with same name as grammar file minus .g
- proj: $(OBJ) $(SRC)
- cc -o $(GRM) $(OBJ)
-
- # how to build parser files from grammar files
- $(GRM).c parser.dlg : $(GRM).g
- antlr $(AFLAGS) $(GRM).g
-
- # how to build scanner from lex description built by ANTLR
- scan.c : parser.dlg
- dlg -C2 parser.dlg scan.c
-
-
-
- 1.5. PCCTS component command-line options
-
- The PCCTS system is composed of ANTLR and DLG both of which have
- command-line options. Since each component is executed independently,
- each may be invoked with different options which are summarized in
- this section.
-
- ANTLR accepts the following command-line options
-
- -cr Generate cross reference of all rules. The default is FALSE.
-
- -e1 Only the first 3 tokens of any ambiguity set are displayed upon
- warning/error. i.e. { A B C ...}.
-
- -e2 Every token in an ambiguity set is displayed upon warning/error.
-
- -e3 Error levels 1 and 2 (-e1, -e2) are sometimes misleading for the
- sake of brevity. For instance, when a block is ambiguous upon
- the sequences
-
-
-
-
- A B
- C D
-
-
-
- Page 6
-
- PCCTS 1.00
-
-
- only { A C }, { B D } is displayed when in fact A D and C B are
- not ambiguous. The -e3 option makes ANTLR generate permutation
- trees in LISP-like notation
-
-
-
-
- (root child1 child2 ... childn)
-
-
-
- The following grammar illustrates the difference
-
-
-
-
- a : (A B|C D|A D|C B)
- | (A D|C B)
- ;
-
-
-
- ANTLR with error level 1 or 2 reports
-
-
-
-
- Antlr parser generator Version 1.00 1989, 1990, 1991
- "t.g", line 1: warning: alts 1 and 2 rule ambiguous upon { A C }, { B D }
-
-
-
- ANTLR with error level 3 reports
-
-
-
-
- Antlr parser generator Version 1.00 1989, 1990, 1991
- "t.g", line 4: warning: alts 1 and 2 rule ambiguous upon ( C B ) ( A D )
-
-
-
- which indicates that rule a is only ambiguous upon the two
- sequences C B and A D.
-
-
- -fe fname
- Rename the file where error information, token sets, and zzto-
- kens[] are placed. Default is err.c.
-
- -fl fname
- Rename the file where ANTLR places the lexical description.
- Default is parser.dlg.
-
-
-
-
- Page 7
-
- PCCTS 1.00
-
-
- -ft fname
- Rename the file where ANTLR places the #define's for labeled con-
- stants and the function prototypes. Default is tokens.h.
-
- -ga Generate ANSI prototypes for functions constructed from rules.
- Default is not to generate ANSI-style prototypes.
-
- -gt Generate code for Abstract-Syntax-Trees. Default is not to gen-
- erate trees.
-
- -gc Do not generate output parser code or lexical description
- (implies -gx). This option can be used to check a grammar for
- ambiguities. Default is to generate parsers.
-
- -ge Generate an error class for each non-terminal of the form
-
-
-
-
- #errclass Rule { rule }
-
-
-
- Default is not to generate the classes.
-
-
- -gs When a token in the current look-ahead needs to be compared
- against two or more tokens, code for set membership is generated
- rather than a set of integer comparisons. This renders the
- parsers more efficient, but much more difficult to read for a
- human. When debugging an ANTLR parser with a source level
- debugger, this command-line option should be used to create more
- readable parsers. The default is to generate sets for any token
- expression list with two or more members.
-
- -gd Generate debugging code; trace rule invocation. This command-
- line option forces ANTLR to generate code that calls zzTRACE()
- with the rule name in which it appears as an argument. zzTRACE()
- can be a macro or a function. The default is not to generate
- debugging code.
-
- -gl Generate line information in PCCTS-generated parser that associ-
- ates a line in the user's grammar with a line in the C code.
- This is useful when the C compiler reports an error in a user
- action. It will report an error in the grammar file rather than
- an error in the C file. This dumps line information like the C
- preprocessor.
-
- -gx Do not generate a lexical description (dlg-related files). This
- is useful if you are positive nothing has changed in the lexical
- portion of your language description. For example, if you simply
- change an action in one of your rules, nothing has changed lexi-
- cally and you need not run DLG to recreate the lexical analyzer.
- The default is to generate a description.
-
-
-
- Page 8
-
- PCCTS 1.00
-
-
- -k n
- Set k, the number of look-ahead tokens, to n. Default is 1.
-
- -p Print out the grammar to stdout without actions. This option is
- useful for documentation purposes or to simply make
- viewing/understanding your grammar easier. The default is not to
- generate a grammar printout.
-
- -pa This option is identical to the -p option accept that the grammar
- is annotated with the results of grammar analysis. For example,
- the set of all tokens that can be matched at k tokens into the
- future are displayed after each point in your grammar where a
- decision has to be made. ANTLR prints out only the sets of
- tokens that would be needed to construct a parser. Rule a below
- illustrates the annotations.
-
-
-
-
- #header <<;>>
- a : (A B | A D)
- | Q
- ;
-
-
-
- Executing ANTLR with
-
-
-
-
- antlr -k 2 -pa t.g
-
-
-
- yields (cleaned up slightly)
-
-
-
-
- Antlr parser generator Version 1.00 1989, 1990, 1991
-
- a : ( A B /* [1] { A }, { B } */
- | A D /* [2] { A }, { D } */
- ) /* [1] { A } */
- | Q /* [2] { Q } */
- ;
-
-
-
- The first alternative of rule a has a subrule with two alterna-
- tives which requires a parsing decision. Since both alternatives
- start with an A, another token of look-ahead must be examined to
- determine which alternative to choose. In this case, the second
-
-
-
- Page 9
-
- PCCTS 1.00
-
-
- token in each alternative allows us to choose. Therefore, two
- sets of tokens (singleton sets here) are printed out for each
- alternative in the subrule. The alternatives of rule a can be
- matched with the knowledge of only the first token and so only
- one set is printed. The default is not to generate a grammar
- printout with annotations.
-
- DLG has the following command-line options.
-
- -Cn
- Specify level of character class compression. 0 means no
- compression, 1 removes unused characters from DFA states, and 2
- specifies that DLG is to combine characters into equivalence
- classes. It is suggested that -C2 be used since it will result
- in much smaller output file and does not require much additional
- processing time.
-
- -ga Produce ANSI C compatible code.
-
- -m filename
- Uses filename rather than mode.h for the lexical mode definition
- file.
-
- -i Attempts to build a lexical scanner for interactive programs.
- The non-interactive scanner always looks one character past the
- end of a token. This look-ahead is avoided whenever possible in
- the interactive version so that the pattern is recognized without
- any additional characters being read. Look-ahead is always used
- when necessary to correctly match input tokens.
-
-
- 1.6. Introductory example
-
- As a simple concrete example of the PCCTS system, consider the
- following translator which accepts a date in the typical American for-
- mat of mm/dd/yy and prints to stdout the same date, but in the Euro-
- pean format of dd/mm/yy.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Page 10
-
- PCCTS 1.00
-
-
-
-
-
-
- #header <<#include "int.h">>
-
- <<
- main()
- {
- ANTLR(date(), stdin);
- }
- >>
-
- #token "[\t\ \n]" << zzskip(); >> /* Ignore White */
- #token Digits2 "[0-9][0-9]" /* Define mm, dd or yy */
-
- date : Digits2 "/" Digits2 "/" Digits2 "@"
- <<printf("%02d/%02d/%02d\n", $3, $1, $5);>>
- ;
-
-
-
- This example is complete in the sense that PCCTS can generate all C
- files necessary to lexically and syntactically analyze a date and con-
- vert it to the European format. An executable can be obtained by per-
- forming the following sequence of commands (assuming the above grammar
- is in file date.g and antlr.h, the standard ANTLR header, is in the
- current directory).
-
-
-
-
- antlr date.g
- dlg parser.dlg scan.c
- cc date.c scan.c err.c
-
-
-
- Executing the file a.out (on UNIX [Trademark AT&T] systems) will allow
- the user to type in a date and have the European version printed out
- (after end of file is typed).
-
- Rule date is defined to be a sequence of two digits followed by a
- / followed by a sequence of 2 digits, and so on, terminated by the end
- of the input file (represented by "@"). Note that rule names begin
- with a lower-case letter. The action
-
-
-
-
- <<printf("%02d/%02d/%02d\n", $3, $1, $5);>>
-
-
-
-
-
-
- Page 11
-
- PCCTS 1.00
-
-
- contains C code that is executed after all elements in rule date have
- been recognized. C code for the action is delimited by << and >>
- (European quotes). Because the >> delimiter otherwise would be con-
- fused with C's right-shift operator within the C code of an action,
- right-shift is written as \>\>. Similarly, the $ and # characters
- normally have special meanings within the C code of a PCCTS action;
- hence, the literal character $ in the C code would be written as \$
- and # would be written as \#.
-
- Although the placement of an action within a grammar determines
- when that action will be applied, it must also be possible for the
- action to be a function of the text which has been recognized. For
- example, PCCTS automatically makes $1 in rule date have a value, or
- attribute, which is a function of the text matched by the first
- occurrence of Digits2. In general, the attribute associated with the
- ith item in a rule is called $i. Both the type of the attributes and
- the mapping from text matched into its attribute value is specified by
- the user. The directive
-
-
-
-
- #header <<#include "int.h">>
-
-
-
- includes code that makes the values of "$ variables" be of the C type
- int with values defined by applying the C function atoi() to the text
- matched.
-
- Each item in the rules refers either to the text matched by a
- simple pattern (a regular expression) or to the collection of items
- matched by a rule (represented by the rule's name). Simple patterns
- can be stated directly, such as "/" in the example, or can be given
- names using the #token directive. The token definitions
-
-
-
-
- #token "[\t\ \n]" << zzskip(); >> /* Ignore White */
- #token Digits2 "[0-9][0-9]" /* Define mm, dd or yy */
-
-
-
- cause tab (\t), space (\ ), and newline (\n) characters between the
- other tokens to be ignored, and allow use of the name Digits2 as a
- shorthand for "[0-9][0-9]".
-
- Since the above is meant to be a stand-alone complete example and
- PCCTS does not generate a C main function, the user must provide a
- main. The main also specifies which rule is to be used to recognize
- the input and where that input will come from. The code
-
-
-
-
-
- Page 12
-
- PCCTS 1.00
-
-
-
-
-
-
- ANTLR(date(), stdin);
-
-
-
- specifies that the date rule will be used to recognize input taken
- from stdin (the standard input stream).
-
- Notice that, although there is no specification of how incorrect
- input should be handled by the generated program, PCCTS will automati-
- cally generate appropriate error messages and attempt to recover from
- the errors.
-
- The above example should give the reader a "feel" for how PCCTS
- can be used, however, the system supports much more than the example
- uses. This document is organized into six major sections: lexical
- analysis, syntactic analysis, attribute handling, error detection,
- miscellaneous items and PCCTS history.
-
- 2. Lexical analysis and token definition
-
- The task of language recognition is conveniently broken down into
- two phases - lexical and syntactic analysis. Each phase is considered
- a separate operation and, therefore, most language tools require
- separate specifications. For example, Coco-2 [DoP90] requires the
- user to create separate, but consistent, specifications of character
- sets, keywords and tokens (and even non-terminals). In contrast,
- PCCTS derives all this information from a single description. This
- both ensures that the information is consistent and makes it easier to
- view a PCCTS language specification as a whole.
-
- This section of the manual describes general lexical issues asso-
- ciated with PCCTS' integrated description including lexical action
- definitions, regular expression specification, and resolution of ambi-
- guities. Also, more specialized issues such as multiple automatons,
- interactive scanners, token positional information and input character
- supply are presented.
-
- 2.1. Token Labeling and token actions
-
- Tokens are defined either explicitly with #token or implicitly by
- using them as rule elements. Implicitly defined tokens can be either
- regular expressions (non-labeled tokens) or token names (labeled).
- Token names begin with an upper-case letter (rules begin with a
- lower-case letter). More than one occurrence of the same regular
- expression in a grammar description produces a single regular expres-
- sion in parser.dlg and is assigned one token number. Quoted and non-
- quoted tokens that refer to the same lexical object (expression) may
- be used interchangeably. Token names that are referenced, but not
- attached to a regular expression are assigned a number and given a
- #define definition in tokens.h. It is not necessary to label any
-
-
-
- Page 13
-
- PCCTS 1.00
-
-
- tokens in PCCTS. However, all tokens that you wish to explicitly
- refer to in an action, must be declared with a #token instruction.
- From PCCTS's point of view, it merely needs to know that a word is a
- token (as opposed to a non-terminal). Its value is irrelevant at the
- symbolic level of PCCTS's meta-language.
-
- The user may introduce tokens, lexical/token actions, and token
- labels via the #token instruction. Specifically,
-
-
-
- o Simply declare a token for use in a user action:
-
-
-
- #token VAR
-
-
-
- o Associate a token with a regular expression and, optionally, an
- action:
-
-
-
- #token VAR "[a-zA-Z][a-zA-Z0-9]*"
- #token Eof "@" << printf("Eof Found\n"); >>
-
-
-
- o Specify what must occur upon a regular expression:
-
-
-
- #token "[0-9]+" << printf("Found int %s\n", zzlextext); >>
-
-
-
- Token actions may employ two functions, zzreplchar() and
- zzreplstr(), to replace the text for the regular expression matched on
- the input stream. For example, if \n is found in a string for C, it
- needs to be converted to the actual newline character. Strings in C
- can be handled in the following manner.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Page 14
-
- PCCTS 1.00
-
-
-
-
-
-
- /* START lexclass is in effect */
- #token "\"" <<zzmode(STRINGS); zzmore();>>
-
- #lexclass STRINGS /* define a new automaton */
- #token STRING "\"" <<zzmode(START);>>
- #token "\\\"" <<zzmore();>>
- #token "\\n" <<zzreplchar('\n'); zzmore();>>
- #token "\\r" <<zzreplchar('\r'); zzmore();>>
- #token "\\t" <<zzreplchar('\t'); zzmore();>>
- #token "\\[1-9][0-9]*"
- <<zzreplchar((char)strtol(zzbegexpr,NULL,10)); zzmore();>>
- #token "\\0[0-7]*" <<zzreplchar((char)strtol(zzbegexpr,NULL,8)); zzmore();>>
- #token "\\0x[0-9a-fA-F]+"
- <<zzreplchar((char)strtol(zzbegexpr,NULL,16)); zzmore();>>
- #token "\\~[\n\r]" <<zzmore();>>
- #token "[\n\r]" <<zzline++; zzmore(); /* print warning about \n in str */>>
- #token "~[\"\n\r\\]+" <<zzmore();>>
-
- #lexclass START
-
-
-
- Note the use of zzbegexpr which points to the beginning of the text
- matched for the currently recognized regular expression. zzendexpr is
- also available and points to the end of the expression. The text
- always has a '\0' on the end and zzendexpr points to the character
- just before.
-
- There is a big difference between a token action and a normal
- action found outside of the #token instructions. An action not
- included as part of one of the lexical commands is considered a syn-
- tactic action and is placed in the C parser file associated with that
- grammar file.
-
- If a grammar action follows a #token definition with no associ-
- ated lexical action, the #token instruction will assume that the
- action is really meant as a lexical action. This is sort of analogous
- to the dangling-else-clause dilemma. Example:
-
-
-
-
- #token VAR "[a-zA-Z][a-zA-Z0-9]*"
-
- <<
- /*
- * This will be associated with the #token even though it is
- * probably a grammar action not a lexical action
- */
- >>
-
-
-
- Page 15
-
- PCCTS 1.00
-
-
- This ambiguity can be solved by appending a null action to the
- #token:
-
-
-
- #token VAR "[a-zA-Z][a-zA-Z0-9]*" << ; >>
-
-
-
-
- 2.2. Lexical actions via #lexaction
-
- Often, it is convenient to have a section of user C code placed
- in the lexical analyzer constructed automatically by DLG (PCCTS's
- DFA-based lexical analyzer generator). Normally, actions not associ-
- ated with a #token pseudo-op or embedded within a rule are placed in
- the parser generated by ANTLR. However, preceding an action appearing
- outside of any rule, with the #lexaction pseudo-op directs the C code
- to the lexical analyzer. For example,
-
-
-
-
- << /* a normal action outside of the rules */ >>
-
- #lexaction
- << /* this action is inserted into the lexical
- * analyzer created by DLG
- */
- >>
-
-
-
- All #lexaction actions are collected and placed as a group into the C
- file where the lexer resides. Typically, this code consists of func-
- tions or variable declarations needed by #token actions.
-
- 2.3. Multiple lexical classes
-
- ANTLR parsers employ DFA's (Deterministic Finite Automatons)
- created by DLG to match tokens found on the character input stream.
- More than one automaton (lexical class) may be defined in PCCTS. Mul-
- tiple scanners are useful in two ways. First, more than one grammar
- can be described within the same PCCTS input file(s). Second, multi-
- ple automatons can be used to recognize tokens that seriously conflict
- with other regular expressions within the same lexical analyzer (e.g.
- comments, quoted-strings, etc...).
-
- Actions attached to regular expressions (which are executed when
- that expression has been matched on the input stream) may switch from
- one lexical analyzer to another. Each analyzer (lex class) has a
- label which is used to enter that automaton. A predefined lexical
- class called START is in effect from the beginning of the PCCTS
- description until the user issues a #lexclass directive or the end of
-
-
-
- Page 16
-
- PCCTS 1.00
-
-
- the description is found.
-
- When more than one lexical class is defined, it is possible to
- have the same regular expression and the same token label defined in
- multiple automatons. Regular expressions found in more than one auto-
- maton are given different token numbers, but token labels are unique
- across lexical class boundaries. For instance,
-
-
-
-
- #lexclass A
- #token LABEL "expr1"
-
- #lexclass B
- #token LABEL "expr2"
-
-
-
- In this case, LABEL is the same token number (#define in C) for both
- "expr1" and "expr2". A reference to LABEL within a rule can be
- matched by two different regular expressions depending on which auto-
- maton is currently active.
-
- Hence, the #lexclass directive marks the start of a new set of
- lexical definitions. Rules found after a #lexclass can only use
- tokens defined within that class - i.e. all tokens defined until the
- next #lexclass or the end of the PCCTS description, whichever comes
- first. Any regular expressions used explicity in these rules are
- placed into the current lexical class. Since the default automaton,
- START, is active upon parser startup, the start rule must be defined
- within the boundaries of the START automaton. Typically, a multiple-
- automaton PCCTS grammar will begin with
-
-
-
-
- #lexclass START
-
-
-
- immediately before the rule definitions to insure that the rules use
- the token definitions in the "main" automaton.
-
- Tokens are given sequential token numbers across all lexical
- classes so that no conflicts arise. This also allows the user to
- reference zztokens[token_num] (which is a string representing the
- label or regular expression defined in the grammar) regardless of
- which class token_num is defined in.
-
- 2.3.1. Multiple grammars, multiple lexical analyzers
-
- Different grammars will generally require separate lexical
- analyzers to break up the input stream into tokens. What may be a
-
-
-
- Page 17
-
- PCCTS 1.00
-
-
- keyword in one language, may be a simple variable in another. The
- #lexclass PCCTS meta-op is used to group tokens into different lexical
- analyzers. For example, to separate two grammars into two lexical
- classes,
-
-
-
-
- #lexclass GRAMMAR1
-
- rules for grammar1
-
- #lexclass GRAMMAR2
-
- rules for grammar2
-
-
-
- All tokens found beyond the #lexclass meta-op will be considered of
- that class.
-
- 2.3.2. Single grammar, multiple lexical analyzers
-
- Again the #lexclass meta-op is used to indicate the beginning of
- a new lexical analyzer. For instance to define C-style comments, the
- following will suffice:
-
-
-
-
- /* lexclass START is in effect by default */
- /* switch to COMMENT automaton */
- #token "/\*" <<zzmode(COMMENT); zzskip();>>
- ...
- #lexclass COMMENT
- #token "\*/" <<zzmode(START); zzskip();>> /* switch back */
- #token "~[\*]*" <<zzskip();>> /* ignore all stuff inside */
- #token "\*~[/]" <<zzskip();>>
- ...
-
-
-
- 2.4. Handling end of input
-
- There is only one automatically defined regular expression -
- end-of-file which is represented by the regular expression @. An
- implicit
-
-
-
-
- #token "@"
-
-
-
-
-
- Page 18
-
- PCCTS 1.00
-
-
- is executed at the start of each lexical class (#lexclass). A label
- may be attached via the following in the user's grammar:
-
-
-
-
- #token Eof "@"
-
-
-
- A user could then refer to end-of-file as the #define constant Eof.
-
- 2.5. Token order and lexical ambiguities
-
- The order in which regular expressions are found in the grammar
- description file(s) is significant. When the input stream contains a
- sequence of characters that match more than one regular expression,
- (e.g. one regular expression is a subset of another) DLG is confronted
- with a dilemma. DLG does not know which regular expression to match,
- hence, it does not know which action should be performed. To resolve
- the ambiguity, DLG assumes that the regular expression which is
- defined earliest in the DLG input should take precedence over later
- definitions. Therefore, tokens that are special cases of other regu-
- lar expressions should be defined before the more general regular
- expressions.
-
- ANTLR automatically constructs the input to DLG by copying regu-
- lar expressions into the file parser.dlg. These definitions are
- copied in the order in which they are encountered among the rules and
- #token definitions. For example, a keyword is a special case of a
- variable and thus needs to occur before the variable definition.
-
-
-
-
- #token KBegin "begin"
- .
- .
- .
- #token VAR "[a-zA-Z][a-zA-Z0-9]*"
-
-
-
-
- 2.6. Quoted tokens
-
- The regular expressions in a PCCTS grammar appear between quote
- marks and are literally copied into the input for DLG. Therefore, the
- regular expressions must use the notation accepted by DLG - a rela-
- tively standard notation using a few meta-symbols.
-
- "a|b"Matches either the pattern a or the pattern b.
-
- "(a)"Matches the pattern a. Pattern a is kept as an indivisible unit.
-
-
-
- Page 19
-
- PCCTS 1.00
-
-
- "{a}"Matches a or nothing, i.e., the same as "(a|)".
-
- "[a]"Matches any single character in character list a. e.g. "[abc]"
- matches either an a, b or c and is equivalent to "(a|b|c)".
-
- "[a-b]"
- Matches any of the single characters whose ASCII codes are
- between a and b inclusively, i.e., the same as "(a|... |b)".
-
- "~[a]"
- Matches any single character except for those in character list
- a.
-
- "~[]"
- Matches any single character; literally "not nothing."
-
- "a*" Matches zero or more occurrences of pattern a.
-
- "a+" Matches one or more occurrences of pattern a, i.e., the same as
- "aa*".
-
- "\a" Matches the single character a - even if a by itself would have a
- different meaning, e.g., "\+" would match the + character. This
- is consistent with the use of \ in the C language.
-
- In addition, the symbols %%, <<, and >> are used to specify control
- for DLG; hence, these patterns must be escaped if they are to appear
- as literals within a regular expression. For example, << would be
- \<\<. The following characters are also special.
-
- @ Matches end-of-file.
-
- \t Tab character
-
- \n Newline character
-
- \r Carriage return character
-
- \b Backspace character
-
- \0nnnMatches character that has octal value nnn
-
- \0xnnMatches character that has hexadecimal value nnn
-
- \mnn
- Matches character with decimal value mnn, 1<m<9
-
- \c Character escape
-
- White space characters (tab and space) are ignored within the quota-
- tion marks. To specify the actual space character use "\ ".
-
-
-
-
-
-
- Page 20
-
- PCCTS 1.00
-
-
- 2.7. Interactive lexical analyzers
-
- Normally, the scanners produced by DLG always have one character
- of look-ahead available. This look-ahead causes the scanner to look
- one character past the end of the current pattern and wait for the
- next character if it is not yet available.
-
- This can cause unexpected behavior from interactive programs such
- as command line interpreters which must respond to input upon newline.
- Normally, the scanner would seek a character beyond the newline; forc-
- ing the user to type an additional character before returning the car-
- riage return token to the parser.
-
- The user may employ the DLG -i command-line option to request
- that unneeded look-ahead be removed whenever possible. This allows
- interactive parsers such as the interpreter mentioned above to get
- tokens without the user having to type additional characters. On some
- systems special flags need to be set, so that the scanner actually
- gets the characters when they are received rather than having the
- operating system buffer them. On BSD unix the cbreak mode needs to be
- used to accomplish this (e.g. stty cbreak).
-
- There are situations where the character look-ahead is still
- required. Typically, this occurs when two patterns have the same pre-
- fix. For example, patterns A and Ab can only be distinguished by exa-
- mining the character following pattern A. A b character would indi-
- cate that the token associated with the second pattern should be
- returned. Also, any pattern that has a closure, + or *, must have
- look-ahead to determine the end of the pattern, since the pattern by
- definition can be repeated indefinitely.
-
- 2.8. DLG lexical input
-
- The user can specify that DLG is to take its input from a stream
- or from a function. The function must behave like getchar() by
- returning a new character per invocation. The function must return -1
- when no more characters are available to signal "end-of-file."
- Presumably, the function would read characters from some buffer and
- hand them one by one to DLG upon request. For example, a user might
- define the following.
-
-
-
-
- char text_edit_buf[BIG];
-
- int nextchar()
- {
- static char *p = &text_edit_buf[0];
- return (p < &text_edit_buf[BIG]) ? *(p++) : -1;
- }
-
-
-
-
-
-
- Page 21
-
- PCCTS 1.00
-
-
- Then, when invoking the parser, the user would call:
-
-
-
-
- {
- ...
- ANTLRf(start_symbol(), nextchar);
- ...
- }
-
-
-
- 2.9. Tracking pattern position
-
- There are several global variables available in the scanners pro-
- duced by DLG to keep track of the position of a pattern within an
- input stream. zzline tracks vertical position and must be explicitly
- updated by the user's grammar. zzbegcol and zzendcol track horizontal
- position-column number within a line. They are maintained by the DLG
- automaton for characters that are normally one character wide.
- zzendcol needs to be corrected for cases when the character is not one
- character wide, e.g. tabs or carriage return.
-
- To enable the position tracking, a #define ZZCOL needs to be put
- in an action in the lexical specification or -DZZCOL needs to be
- included in the command line when compiling the C code for the lexical
- analyzer.
-
- This section described how a stream of input characters can be
- collected into a sequence of tokens and how one can act upon the
- recognition of particular tokens. Sentences in a language are com-
- posed of tokens and are described using a set of rules. The next sec-
- tion presents the PCCTS meta-language for describing language syntax.
-
- 3. Syntactic analysis
-
- Once the lexical structure of a language has been specified, the
- way in which tokens can be combined to form sentences must be
- described. The set of valid sentences in a language is described by a
- set of rules, written in an appropriate meta-language, that impose a
- structure on the sequence of input tokens.
-
- In most parser-generator tools, context-free grammars are speci-
- fied in a notation which is equivalent to BNF [Nau63]. However, the
- notation used in regular expressions to permit repetition, alterna-
- tives, etc. can be used to extend BNF (EBNF) so that context-free
- grammars can be expressed more concisely. This also has the advantage
- of being more consistent with the regular expression notation and pro-
- viding additional information which can be used to create a more effi-
- cient parser. For these reasons, PCCTS accepts an EBNF-like grammar
- notation.
-
- The input for PCCTS differs from EBNF primarily in that ::= is
-
-
-
- Page 22
-
- PCCTS 1.00
-
-
- replaced by : and non-terminals are distinguished from terminals by
- requiring all non-terminals to begin with a lowercase letter, rather
- than by bracketing non-terminals with <>. Further, since rules can
- become long when actions are embedded, each rule is terminated by a ;
- symbol. These modifications are not arbitrary, but mirror the differ-
- ences between YACC grammars and BNF (although PCCTS provides a number
- of extensions beyond YACC).
-
- As an example rule, consider:
-
-
-
-
- block: "begin" (statement)* "end" ;
-
-
-
- It indicates that rule a is defined to be the word "begin", followed
- by zero or more statement's, followed by the word "end".
-
- This section describes the syntax of PCCTS rules and the method
- by which rules communicate at run-time.
-
- 3.1. PCCTS rule definitions
-
- A rule is formally defined, in PCCTS's own notation, as:
-
-
-
-
- rule : NonTerminal /* rule name */
- { "!" } /* turn off automatic AST construction */
- { {"\<"} InheritAction } /* downward-inheritance ("input") */
- { "\>" InheritAction } /* upward-inheritance ("output") */
- { QuotedTerm } /* name for error reporting */
- ":" block ";" /* actual rule elements */
- { Action } /* fail action */
- ;
-
- block : alt ("\|" alt)*
- ;
-
- alt : ( element )*
- ;
-
- element : TokenTerm { "^" | "!" }
- | QuotedTerm { "^" | "!" }
- | NonTerminal { "!" }
- { {"\<"} InheritAction } { "\>" InheritAction }
- | Action
- | "\(" block "\)" { "\*" | "\+" } { InheritAction }
- | "\{" block "\}" { InheritAction }
- ;
-
-
-
-
- Page 23
-
- PCCTS 1.00
-
-
- where InheritAction is a C expression enclosed in [ ] used for both
- upward- and downward-inheritance. QuotedTerm is a regular expression
- and TokenTerm is a label associated with some lexeme.
-
- 3.1.1. Subrules
-
- Rules that employ EBNF constructs implicitly define subrules
- according to the following syntax:
-
- (i) b =(a | a | ... | a ); match 1 time
- i 1 2 m
- [Figure omitted]
-
- (ii) b ={a | a | ... | a }; match 0 or 1 times
- i 1 2 m
- [Figure omitted]
-
- (iii)b =(a | a | ... | a )*; match 0 or more times
- i 1 2 m
- [Figure omitted]
-
- (iv) b =(a | a | ... | a )+; match 1 or more times
- i 1 2 m
- [Figure omitted]
-
-
- where a represents an alternative (list of rule elements).
- i
- With each enclosing set of meta-symbols, a new level of alterna-
- tives is permitted. Each rule or subrule thus created is called a
- "block." In terms of the recognizer generated, blocks generated for
- rules and subrules are virtually indistinguishable. Subrules are like
- macro expansions of rules. However, there is no label associated with
- a subrule and thus cannot be referred to by another rule or from else-
- where within that rule.
-
- Consider the following rule which uses EBNF constructs:
-
-
-
-
- e12 : {"\+" | "\-"} VAR
- | (FuncName | ProcName) parameterList
- | "[0-9]+"
- ;
-
-
-
- Rule e12 has 3 alternatives. The first alternative has an
- optional block with 2 alternatives as its first element. The { } is
- an optional subrule. VAR is a simple token. The optional block says
- that a VAR can have a plus or minus in front, but it is not necessary.
- The second alternative of rule a also begins with a subrule. The
- block indicates that this alternative must begin with a function or
-
-
-
- Page 24
-
- PCCTS 1.00
-
-
- procedure name. parameterList is some other rule which presumably
- looks for a list of parameters. Alternative 3 is simply a token which
- recognizes an integer number.
-
- Subrule meta-symbols alter the associativity of the | alternative
- operator and can be nested to any number of levels. |'s have the
- lowest precedence.
-
-
-
-
- a: B | C D ;
-
-
-
- Rule a specifies that the parser should match either B or C D.
- It will not match a D preceded by a B or C. The following version of
- a will accomplish that:
-
-
-
-
- a: (B | C) D;
-
-
-
- Empty alternatives called epsilon transfers can also be used to
- imply optionality. They are useful when you want an action to be per-
- formed when nothing is matched by the other alternatives of the block.
-
-
-
-
- a: B | C | ;
- b: { B | C } ;
-
-
-
- Rules a and b are identical in what they match. However, the
- resulting C code could be very different. Also note that with rule a,
- an action can placed in the empty alternative to be executed when a B
- or C is not matched. Rule b does not have this capability.
-
- Subrules are allowed at most one parameter which must be of type
- Attrib. The attribute $0 receives the inherited parameter for
- subrules. Subrules return values in $0 as well (which become the
- appropriate $i in the enclosing subrule/scope above). $0 is initially
- undefined in rules and in subrules with no explicit parameter or
- default $0 initialization. See section 3.1.2.
-
- 3.1.2. Rule communication
-
- An advantage of LL parsers over LR parsers is that inheritance
- (rule/subrule communication) is relatively simple and efficient.
-
-
-
- Page 25
-
- PCCTS 1.00
-
-
- PCCTS allows each non-terminal to have an arbitrary set of downward-
- inherited (parameters) and upward-inherited (return value) attributes
- (see section 4.1.6). The specification syntax is presented in this
- section whereas the use of rule communication is illustrated in sec-
- tion 4.1.6.2:
-
-
-
-
- rule[parameters] > [return_type(s)] : X | X | ... | X ;
- 1 2 m
-
-
- which provide more powerful attribute handling than can be achieved
- with the "$ variables" found in systems like YACC (and also supported
- in PCCTS, primarily to ease the transition for users familiar with
- YACC and to communicate with the lexical analyzer).
-
- Communicating with other rules should be as flexible as it is in
- a programming language. Hence, in addition to the standard attribute
- mechanism, PCCTS supports the familiar high-level language parameter
- passing scheme. The arguments to a rule must be consistent with the
- parameter specifications given with the definition of the rule. For
- example,
-
-
-
-
- <<
- #define GLOBAL 1
- #define LOCAL 2
- >>
-
- prog : <<Sym *vars;>>
- TYPE dlist[GLOBAL] > [vars]
- <<OperateOn(vars);>>
- ...
- ;
-
- /* Recognize a list of declarations. The scoping level is passed
- * in and the list of symbols is returned.
- */
- dlist[int level] > [Sym *list]
- : <<$list=NULL;>>
- VAR <<add($list, $1, $level);>>
- ( ","
- VAR <<add($list, $2, $level);>>
- )*
- ;
-
-
-
- where add() is some function that adds an element to a list. Here we
- have used the attributes only as a means of communicating with the
-
-
-
- Page 26
-
- PCCTS 1.00
-
-
- lexical analyzer.
-
- In contrast to most programming languages, more than one value
- can be returned. Each return value is set by simply referring to it
- by the name given in the return value definition (prefixed with $) -
- e.g. $list in the dlist example.
-
- Subrules may not have return types nor parameter list defini-
- tions. They may only pass a single attribute which becomes the ini-
- tial value of $0 in the subrule.[1]
-
- To illustrate the effect of rule communications upon local vari-
- ables and scoping, consider:
-
-
-
-
- rule[int a, Attrib b] > [float q, int r] /* a,b visible to rule */
- : <<int i,j; printf("starting rule a\n");>> /* i,j visible to rule */
- ...
- ( <<int i;>> /* i hides rule-local i */
- ...
- ( <<int i,r;>> /* def hides outer scope i; local r can't be used */
- <<$r=5;>> /* $r is return value not local var */
- )
- )
- ;
-
-
-
- Note that return values and parameters are always given priority over
- local variables. While generating C code for a rule, ANTLR scans the
- actions and translates references to return values and parameters
- irrespective of scope.
-
- 3.1.3. Miscellaneous notes
-
- The user may attach a string to a non-terminal that is passed on
- to the error reporting macro/function zzsyn(). See section 5 on error
- reporting for more details.
-
- The user may elect to have PCCTS automatically construct
- abstract-syntax-trees (AST's). To disable automatic AST construction
- for a rule, employ the ! as indicated above in the rule grammar. See
- section 4.2 on abstract-syntax-trees.
-
-
-
- _________________________
- [1] An earlier version of PCCTS treated subrules and
- rules identically with respect to parameters and return
- values. The current distinction represents an
- extension to the handling of rules.
-
-
-
-
- Page 27
-
- PCCTS 1.00
-
-
- 3.1.4. LL(k) parsing
-
- ANTLR attempts to build parsers that use as little look-ahead as
- possible. Ambiguities force ANTLR to attempt resolution using more
- and more tokens of look-ahead. For example, ANTLR tries to resolve
- LL(1) ambiguities with LL(2). If the grammar construct is still ambi-
- guous, LL(k>2) is tried until either all ambiguities are resolved or
- the user-defined maximum k has been reached. A subrule or rule which
- requires LL(k>1) does not constrain all others to LL(k>1). ANTLR
- builds mixed model parsers where each parsing decision is made with
- the smallest of amount of look-ahead possible. The -k command-line
- option is used to bound the maximum number of tokens of look-ahead to
- be used by ANTLR-generated parsers.
-
- Most programming language constructs that cannot be described in
- LL(1) can be easily described in LL(2). Typically, LL(2) is only
- required in a handful of rules yielding a general LL(1) parser with a
- few anomalies. For example, when parsing the C programming language,
- a left-parenthesis could be the beginning of a type-cast or the begin-
- ning of an expression. With two tokens of look-ahead a parser could
- use the token following the parenthesis to determine whether an
- expression or type-cast followed. A more common problem lies in the
- definition of labels. The following grammar fragment illustrates the
- problem.
-
-
-
-
- stat : { WORD ":" } /* label definition */
- (
- | "if" ...
- | "while" ...
- .
- .
- | WORD "\(" ... /* forward ref to function */
- | FUNC "\(" ... /* ref to function already seen */
- )
- ;
-
-
-
- With only one token of look-ahead, function stat() could not determine
- whether a WORD on the input stream was to be matched as a label or a
- yet to be seen function. However, if two tokens or more were avail-
- able, the colon or left-parenthesis following WORD would unambiguously
- indicate which alternative to match.
-
- 3.2. Grammar ambiguities
-
- PCCTS parsers are of the top-down, recursive descent variety
- which implies that all parsing decisions must be left-decidable-
- decidable given a finite sequence of input tokens looking from left to
- right across productions. Because of this limitation, there are gram-
- mars for which PCCTS cannot construct a parser. However, any language
-
-
-
- Page 28
-
- PCCTS 1.00
-
-
- can be described by several, equivalent grammars. Ambiguities that
- arise in a grammar may often be circumvented by modifying that grammar
- while still recognizing the same language.
-
- A grammar is considered ambiguous (in PCCTS's opinion) if one
- input sequence of k tokens can be matched by two different
- productions/alternatives within the same block. For example, if k
- equals one,
-
-
-
-
- a : A B
- | A C
- ;
-
-
-
- is ambiguous because if all the parser "sees" is A, it will be unable
- to choose between alternatives one and two. ANTLR reports the follow-
- ing error message:
-
-
-
-
- Antlr parser generator Version 1.00 1989, 1990, 1991
- "t.g", line 1: warning: alts 1 and 2 ambiguous upon { A }
-
-
-
- If we allow two tokens of look-ahead (with -k 2), a parser would see
- either A B or A C and could uniquely determine which production to
- match. If subrules and/or rule references are used, the set of possi-
- ble input permutations grows quickly.
-
-
-
-
- a : A (B | D) E F H
- | A b E G H
- ;
-
- b : B
- ;
-
-
-
- Rule a is ambiguous when k equals one or two since A B can start
- either production. If we increase k to three, rule a is still ambigu-
- ous because A B E begins both productions. Only when k is set to four
- is a unambiguous. ANTLR applies a similar logic when constructing
- parsers. When an ambiguity occurs at k==i, k=i+1 is tried until
- either the ambiguity is resolved or the maximum look-ahead defined by
- the user is reached (in which case the ambiguity is reported to the
-
-
-
- Page 29
-
- PCCTS 1.00
-
-
- user). ANTLR reports the following for k==2,
-
-
-
-
- Antlr parser generator Version 1.00 1989, 1990, 1991
- "t.g", line 1: warning: alts 1 and 2 ambiguous upon { A }, { B }
-
-
-
- Note the sequence A B is denoted by the singleton set { A } followed
- by the singleton set { B }. In general, ambiguous input sequences of
- k tokens are represented by a sequence of k sets.
-
- The -pa command-line option can be used to print out the user's
- grammar (minus actions) annotated with the input token permutations.
- The set sequence { A B }, { C D } means that the following four input
- token permutations are possible:
-
-
-
-
- A C
- A D
- B C
- B D
-
-
-
- 3.3. Look-ahead size
-
- The user's grammar can be analyzed using any integer k value
- (limited by machine size and how long the user wants to wait). How-
- ever, parsers constructed by ANTLR always use k values equal to the
- nearest power of two greater than or equal to the k specified by the
- user. ANTLR parsers treat the look-ahead as a circular buffer in
- which a "modulo k" operation is performed repeatedly. If k is a power
- of two, this operation reduces to a logical-and operation whereas
- other values generally require a divide - i.e. a time consuming opera-
- tion. Since the modulo operation is performed at least once for each
- input token, a divide operation would seriously degrade parser perfor-
- mance. The #define constant LL_K is set by ANTLR to the k associated
- with parser execution. Note that files not constructed by ANTLR will
- need to define LL_K if they include antlr.h. This can be handled with
- a "-DLL_K=k" command-line option on most C compilers.
-
- 3.4. ANTLR parser construction
-
- ANTLR generates parsers in pure C code--i.e. non-interpretive,
- mutually-recursive sets of functions. There is exactly one C function
- for every rule present in an ANTLR grammar description. Transforma-
- tions exist to convert grammar constructs to C language constructs.
- To perform lexical analysis, ANTLR prepares a DLG input file that will
- be converted to a C scanner by DLG and then linked into the parser.
-
-
-
- Page 30
-
- PCCTS 1.00
-
-
- Because of ANTLR's flexibility, C code generation templates vary
- depending on command-line options and #define's [Par90]. ANTLR
- parsers are fast primarily because they never have to back up and map
- EBNF grammar constructs to efficient target language constructs. The
- current k tokens of look-ahead always tells the parser which alterna-
- tive to choose. Also, C's efficient function invocation mechanism
- allows references to other rules to be handled quickly.
-
- 3.4.1. Efficiency
-
- The way grammars are described has a big impact on phrase recog-
- nition speed. Rules invoking other rules with tail-recursion (right-
- recursion) like
-
-
-
-
- /* S l o w A n d U g l y */
- a: B c ;
- c: a | ;
-
-
-
- can be replaced with
-
-
-
-
- /* F a s t e r B u t S t i l l U g l y */
- a: B (a | ) ;
-
-
-
- or, better yet,
-
-
-
-
- /* F a s t e s t A n d R e a d a b l e */
- a: B (B)* ;
-
-
-
- which is more concise and infinitely more readable.
-
- This example leads us to two important generalizations:
-
-
-
- o Subrules are much faster than invoking another rule to match the
- same subgrammar.
-
- o Use the ( )* and ( )+ closure operators instead of tail recur-
- sion. This generates a while loop instead of recursive function
-
-
-
- Page 31
-
- PCCTS 1.00
-
-
- calls.
-
-
- Rules or subrules with many alternatives tend to generate bigger
- and slower programs. This arises from the fact that a linear search
- implemented by a sequence of if statements is used to determine which
- alternative to choose. Although general LL(k) parsing makes it diffi-
- cult to optimize this search for the correct alternative, future ver-
- sions of PCCTS will probably make a number of special-case improve-
- ments, such as generating a switch statement for each alternative
- selection problem which is LL(1).
-
- In many parser-generators, such as YACC, adding actions to a
- grammar can cause rules to be split, introducing ambiguities and/or
- extra states in the generated parser. In contrast, actions have no
- affect on the structure of the recognizers constructed by PCCTS.
- PCCTS simply generates C code templates that implement the recognizer
- and embeds C code for the user-supplied actions at the proper points.
- User actions have no significant affect on parsing speed.
-
- 3.4.2. Debugging parsers
-
- PCCTS parsers have one function for each rule. The function name
- will coincide with the rule it was constructed from and therefore can
- be referenced from within a source-level debugger. Breakpoints can be
- set so that parser execution stops before or after certain grammar
- fragments of interest have been recognized. PCCTS parsers are easy to
- follow because parsing decisions are made using simple if and while
- statements.
-
- Often it is convenient to track the sequence of rule invocations
- made in an PCCTS parser. The -gd command-line option forces PCCTS to
- generate code that calls zzTRACE() with the rule name in which it
- appears as an argument. zzTRACE() can be a macro or a function.
-
- 3.5. PCCTS parser template
-
- The following code template can be used to begin parser develop-
- ment.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Page 32
-
- PCCTS 1.00
-
-
-
-
-
-
- /*
- * A N T L R P a r s e r T e m p l a t e
- *
- * Terence Parr, Hank Dietz and Will Cohen
- * CARP Research Group
- * Purdue University Electrical Engineering
- * PCCTS Version 1.00
- */
- #header << /* Define Attrib/AST or include standard package */ >>
-
- /* O p t i o n a l */
- #token "[\t\ ]+" << zzskip(); >> /* Ignore White */
- #token "\n" << zzline++; zzskip(); >> /* Track Line # */
-
-
-
-
-
-
- <<
- main() { ANTLR(startSymbol(), stdin); }
-
- zzcr_attr(attr, token, text)
- Attrib *attr;
- int token;
- char *text;
- {
- /* *attr = Function(text, token) */
- }
-
- zzd_attr(attr) Attrib *attr; { /* destroy attribute */ }
-
- zzcr_ast(ast, attr, tok, text)
- AST *ast;
- Attrib *attr;
- int tok;
- char *text;
- {
- /* *ast = Function(attr) */
- }
-
-
-
-
-
-
-
-
-
-
-
-
-
- Page 33
-
- PCCTS 1.00
-
-
-
-
-
-
- /* node constructor for use with #[] */
- AST zzmk_ast(node, user_args)
- AST *node; /* newly created node */
- {
- /* fill in node->fields */
- return node;
- }
-
- zzd_ast(ast) AST *ast; { /* destroy AST node pointed to by 'ast' */ }
-
- #define zzdef0(zero) /* *(zero) = initial $0 attrib */
- >>
-
- /* P u t R u l e s H e r e */
-
-
-
- 3.6. Grammatical actions
-
- Actions are blocks of C code enclosed in << and >>. They are
- placed in the ANTLR-generated C files according to position in the
- corresponding grammar file. In general, an action following a token
- or non-terminal reference is executed after that grammar element has
- been recognized. There is no restriction concerning their placement
- within rules. However, actions cannot be specified between rules.
- They may only be used before and after the set of rules.
-
- 3.6.1. Init Actions
-
- If an action is specified before any grammar element is found
- after the : in a rule definition or after the start of a subrule, it
- is placed before any ANTLR generated C code for that rule or subrule.
- This allows the user to define variables and execute C code in a scope
- local to that rule or subrule; any C code given is executed just
- before the rule or subrule is applied. For example, consider the
- definition of a rule a:
-
-
-
-
- a : <<List *p=NULL;>> /* Get a type followed by a list of vars */
- Type
- ( <<int i=0;>>
- Var <<append(p, i++, $1);>>
- )*
- <<OperateOn(p);>>
- ;
-
-
-
-
-
-
- Page 34
-
- PCCTS 1.00
-
-
- The init action <<List *p=NULL;>> defines a local auto variable called
- p and initializes it to NULL. The subrule also has an init action,
- <<int i=0;>>, which is executed every time the subrule is entered.
-
- The init actions of optional blocks are executed whether or not
- the elements in the block are recognized. This is due to the fact
- that subrules (C code block) must be entered in order to determine
- whether or not that grammar construct can be matched on the input
- stream. This can be confusing, but on the other hand, we can force a
- variable to have a value even if nothing is matched in an optional
- block. For example,
-
-
-
-
- a : <<Sym *p;>>
- ...
- { <<p=NULL;>> ":" Type <<p=$2;>> }
- <<OperateOn(p);>>
- ...
- ;
-
-
-
- The init-actions of closure blocks (( )* subrules) are executed
- only once - upon entry to the block. This allows local variable ini-
- tialization within closure blocks.
-
- Only $0 can be referenced in an init-action since nothing will
- have been matched when the action executes. Default $0 initialization
- occurs before init-actions are executed.
-
-
- 3.6.2. Fail actions
-
- Fail actions are actions that are placed immediately following
- the ; rule terminator. They are executed after a syntax error has
- been detected but before a message is printed and the attributes have
- been destroyed (optionally with zzd_attr(); see the section on attri-
- butes). However, attributes are not valid here because one does not
- know at what point the error occurred and which attributes even exist.
- Fail actions are often useful for cleaning up data structures or free-
- ing memory. For example,
-
-
-
-
- a : <<List *p=NULL;>>
- ( Var <<append(p, $1);>> )+
- <<OperateOn(p); rmlist(p);>>
- ;
- <<rmlist(p);>>
-
-
-
-
-
- Page 35
-
- PCCTS 1.00
-
-
- The ( )+ loop matches a lists of variables (Var's) and collections
- them in a list. The fail-action <<rmlist(p);>> specifies that if and
- when a syntax error occurs, the elements are to be free()'ed.
-
- 3.6.3. Actions appearing outside of rules
-
- When PCCTS is used to construct a complete compiler, it is useful
- to be able to incorporate support code written in C in the same file.
- For example, there must always be a main specified and the ANTLR-
- generated parser must be invoked using the ANTLR macro. Hence, PCCTS
- allows arbitrary C code to appear as "actions" outside of the grammar
- rules. The PCCTS description
-
-
-
-
- <<
- #include "junk"
- static int local_to_the_parser;
- main() {ANTLR(a(), stdin);}
- >>
-
- a: a_stuff ;
- b: b_stuff ;
-
- << hello() { printf("Hello?"\n"); } >>
-
-
-
- will result in the following output C code:
-
-
-
-
- #include "junk"
- static int local_to_the_parser;
- main() {ANTLR(a(), stdin);}
-
- a()
- {
- ... code to recognize a_stuff ...
- }
-
- b()
- {
- ... code to recognize b_stuff ...
- }
-
- hello() { printf("Hello?"\n"); }
-
-
-
- Note that although these segments of C code look like "actions,"
- and the same rules apply concerning the use of escape characters, they
-
-
-
- Page 36
-
- PCCTS 1.00
-
-
- are not actions per se. Hence, no attribute values are accessible.
-
- 4. Attribute handling
-
- Given a grammar, PCCTS constructs a C program that recognizes
- phrases in the language described by that grammar. No translation
- from source language to output language is performed; the resulting
- parser merely complains if a syntax error is detected. To effect a
- transformation, actions must be embedded within the grammar that
- either perform an immediate translation or create an intermediate-form
- which is translated at a later time. Grammar actions have access to
- lexical information via run-time objects called attributes which are a
- function of the token number and the text of the lexeme found on the
- input stream. PCCTS also supports the creation and manipulation of
- abstract-syntax-trees (AST's) to handle more complicated transforma-
- tions. This section describes the use of attributes, attribute inher-
- itance as well as other forms of rule communication, and the handling
- of AST's.
-
- 4.1. General attribute handling
-
- Attributes are objects that are associated with all rules and
- rule elements including block elements like { }. Attributes are gen-
- erally identified by $i (where i is a positive integer) and are used
- in actions to identify the values of rule elements. Attributes behave
- like variables. They are mentioned and labeled at analysis-time, but
- have values only at run-time. The positive integers uniquely identify
- an element within the currently active block and within the current
- alternative of that block. With the invocation of each new block, a
- new set of attributes becomes active and the previously active set is
- temporarily inactive (unless $i.j, $$ or $inheritance_attr attributes
- are used; see sections 4.1.4). Attributes are scoped exactly like
- local stack-based variables in C.
-
- $-variables refer to user-defined objects associated with each
- rule element. The zzcr_attr() function transforms lexical objects
- into grammar attributes. Attributes associated with terminals are
- implicitly created and are a function of lexemes. However, attributes
- of non-terminals and subrules are explicitly manipulated by actions
- provided by the user. Terminals, which are lexical tokens, simply
- have a value associated with them and are generally referenced not
- reset by actions. Attributes that are returned from or passed to
- blocks are considered synthetic attributes. Rules and subrules may
- communicate via attributes or with familiar high-level language param-
- eter passing techniques (see sections 3.1.2 and 4.1.6).
-
- PCCTS requires that the user define the data type or structure of
- the attributes as well as specify how to convert from lexemes to
- attributes. This provides significant control over what information
- attributes carry and how the lexical analyzer should react to lexical
- objects found on the input stream. The type of the attributes is
- always defined by a C typedef named Attrib and must be defined before
- antlr.h is included in your C program. Using the PCCTS #header
- instruction, the user may specify the C definition or #include the
-
-
-
- Page 37
-
- PCCTS 1.00
-
-
- file needed to define Attrib. The action associated with #header is
- placed in every C file generated from the input grammar files. Any C
- file created by the user that includes antlr.h must once again define
- Attrib before #include'ing antlr.h.
-
- Attributes are stored and accessed in stack fashion. With the
- recognition of each element in a rule, a new attribute is pushed on
- the stack-i.e. becomes available. In other words, the ith attribute
- is not available until the ith rule element has been matched. If a
- rule element is something other than a simple token, many attributes
- may be pushed on the attribute stack until they reduce to a single
- attribute for that rule reference or subrule.
-
- For example, if attributes are to be simple integers, the follow-
- ing C definition is sufficient and necessary
-
-
-
-
- typedef int Attrib;.
-
-
-
- However, this does not specify how those integer attributes are con-
- verted from character strings in the input stream.
-
- 4.1.1. Attribute creation
-
- The attributes associated with terminals must be a function of
- the text and the token number associated with that lexeme. These
- values are passed to zzcr_attr() which computes the attribute to be
- associated with that lexeme. The user must define a function or macro
- that has the following form:
-
-
-
-
- zzcr_attr(attr, token, text)
- Attrib *attr; /* Pointer to attribute associated with this lexeme */
- int token;
- char *text; /* text associated with lexeme */
- {
- /* *attr = F(text,token); */
- }
-
-
-
- Consider the following Attrib and zzcr_attr() definition.
-
-
-
-
-
-
-
-
-
- Page 38
-
- PCCTS 1.00
-
-
-
-
-
-
- typedef union {int ival; float fval;} Attrib;
-
- zzcr_attr(attr, token, text)
- Attrib *attr;
- int token;
- char *text;
- {
- switch ( token )
- {
- case INT : attr->ival = atoi(text); break;
- case FLOAT : attr->fval = atof(text); break;
- }
- }
-
-
-
- The typedef specifies that attributes are integer or floating point
- values. When the regular expression for a floating point number
- (which has been labeled FLOAT) is matched on the input stream,
- zzcr_attr() converts the string of characters representing that number
- to a C float.
-
- The next attribute example is more involved. Attributes are
- either symbol table entries or integer values. Variables (VAR),
- intrinsic types (BuiltInType) and parameters (PARAMETER) found on the
- input stream are converted to symbol table pointers. Integers are
- converted to C int's.
-
-
-
-
- /* Assume Sym, zzsymget() are defined elsewhere */
- #header <<typedef union { Sym *var; int val; } Attrib;>>
-
- <<
- Sym *cursym;
-
- main() { ANTLR(prog(), stdin); }
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Page 39
-
- PCCTS 1.00
-
-
-
-
-
-
- zzcr_attr(attr, token, text)
- Attrib *attr;
- int token;
- char *text;
- {
- switch ( token )
- {
- case WORD :
- case BuiltInType :
- case PARAMETER :
- case VAR : attr->var = cursym;
- break;
- case INT : attr->val = atoi(text);
- break;
- }
- }
- >>
-
-
-
-
-
-
- #token WORD "[a-zA-Z_][a-zA-Z_]*"
- <<{ extern Sym *cursym;
-
- cursym = zzsymget(zzlextext);
- if ( cursym == NULL ) cursym = zzsymnew(zzlextext);
- else LA(1) = cursym->token;
- }>>
-
- prog : ... VAR ... ;
-
-
-
-
- When a WORD is matched on the input stream, the character string
- is looked up in the symbol table. If it exists, a pointer to it is
- placed in cursym and the current look-ahead token (LA(1)) is modified
- accordingly. Otherwise, a symbol table record is created leaving cur-
- sym pointing to it. Also of note, we cannot place the zzsymget() call
- into zzcr_attr() because of the look-ahead. zzcr_attr() is called
- after the token has been recognized; modifying LA(1) in zzcr_attr()
- would have no effect. Parsing decisions are made based upon current
- look-ahead and, therefore, the lexical action must "compute" the token
- value before the parser receives it.
-
- 4.1.2. Attribute destruction
-
- The user may elect to "destroy" all attributes created with
-
-
-
- Page 40
-
- PCCTS 1.00
-
-
- zzcr_attr(). A macro or function called zzd_attr(), is executed once
- for every attribute. This occurs collectively at the end of every
- block. zzd_attr() is passed the address of the attribute to destroy.
- This can be useful when memory is allocated with zzcr_attr() and needs
- to be free()'ed. For example, sometimes zzcr_attr() needs to make
- copies of some lexical objects temporarily. Rather than explicitly
- inserting code into the grammar to free these copies, zzd_attr() can
- be used to do it implicitly. Notice that this concept is similar to
- the constructors and destructors of C++ [Str87].
-
- Attributes are often character strings. Copies of the lexical
- text buffer (managed by DLG) are made which later need to be deallo-
- cated. This can be accomplished with code similar to the following.
-
-
-
-
- #header <<#define zzd_attr(attr) {free(*(attr));}
- typedef char *Attrib;>>
-
- <<
- zzcr_attr(attr, token, text)
- Attrib *attr;
- int token;
- char *text;
- {
- extern char *malloc();
-
- if ( token == StringLiteral )
- {
- *attr = malloc( strlen(text)+1 );
- strcpy(*attr, text);
- }
- }
- >>
-
-
-
- 4.1.3. Standard attribute definitions
-
- Some typical attribute types are defined in the PCCTS include
- directory. These standard attribute types are contained in the fol-
- lowing include files:
-
-
- charbuf.h
- Attributes are fixed-size buffers, each of 32 characters in
- length. If a string longer than 31 characters (31 + 1 '\0' ter-
- minator) is matched for a token, it is truncated to 31 charac-
- ters. The user can change this buffer length from the default 32
- by redefining ZZTEXTSIZE before the point where charbuf.h is
- included. The text for an attribute must be referenced as
- $i.text.
-
-
-
-
- Page 41
-
- PCCTS 1.00
-
-
- int.hAttributes are int values derived from tokens using the atoi()
- function.
-
- charptr.h, charptr.c
- Attributes are pointers to dynamically-allocated variable-length
- strings. Although generally both more efficient and more flexi-
- ble than charbuf.h, these attribute handlers use malloc() and
- free(), hence, they can cause memory fragmentation. The file
- charptr.c must be #include'd, or linked with, the C code PCCTS
- generates for any grammar using charptr.h.
-
-
- In addition to the standard library of attribute handlers, the user
- may define new attribute types and handlers - which should also be
- placed in the PCCTS include directory. The makefile template given in
- section 1.4 knows where the include files live and will inform the C
- compiler.
-
- The following example uses the charptr package:
-
-
-
-
- #header <<#include "charptr.h">>
-
- <<
- #include "charptr.c"
- main() { ANTLR(a(), stdin); }
- >>
-
- #token "\n" <<zzskip();>>
-
- a : "[a-z]+" <<printf("matched %s\n", $1);>>
- ;
-
-
-
- 4.1.4. Attribute references
-
- PCCTS accepts five basic attribute referencing notations:
-
- $i This format refers to the ith rule element within the current
- alternative which can be a rule reference, token or subrule.
-
- $i.j This format extends the $i notation to allow the user to refer to
- attributes at scopes above the current scope. i is the scope and
- j is the jth rule element within the alternative that encloses
- the current scope.
-
- $$ This is a shorthand for the $0 of the enclosing rule and can be
- referred to from an action anywhere within the rule.
-
- $label
- $rule is identical to $$ where rule is the name of the enclosing
-
-
-
- Page 42
-
- PCCTS 1.00
-
-
- rule.
-
- $parameter refers to one of the user-defined parameters given in
- the enclosing rule definition.
-
- $return_value refers to one of the user-defined return values
- given in the enclosing rule definition.
-
- $[token, text]
- This represents an attribute constructor. Its value is computed
- using zzcr_attr just like any other attribute with token and text
- as parameters. This is an explicit method of attribute creation
- whereas normally attributes are created implicitly - one for each
- input terminal. This mechanism can be useful when creating AST
- nodes which are functions of attributes or for passing values to
- subrules. For example, if attributes were integers and integer
- tokens were labeled NUM, $[NUM,"34"] would create an attribute
- whose value was 34. $[] returns an empty attribute node.
-
- 4.1.4.1. $i references
-
- As a simple example of $i numbering in rules with alternatives,
- consider the following.
-
-
-
-
- a: B | C ;
-
-
-
- Rule a has 2 alternatives. $i refers to the ith rule element in the
- current block and within the same alternative. So, in rule a, both B
- and C are $1.
-
- Below, rule a has a subrule as the second rule element and is
- referenced as $2 after the entire subrule has been matched. Assume
- that charptr.h attributes are in effect.
-
-
-
-
- a : "ick" ("A" <<$0 = $1;>> | "B" <<$0 = $1;>>) "ugh"
- <<printf("$1,$2,$3 are %s,%s,%s\n", $1,$2,$3);>>
- ;
-
-
-
- In this case, one of the following will be printed depending on
- whether "A" or "B" was found
-
-
-
-
-
-
-
- Page 43
-
- PCCTS 1.00
-
-
-
-
-
-
- $1,$2,$3 are ick,A,ugh
- $1,$2,$3 are ick,B,ugh
-
-
-
- $2 is the attribute of the subrule ("A"|"B") and is the second
- rule element of rule a (outermost level). However, within the
- subrule, "A" and "B" are both $1. The initial value of $0 will be
- undefined unless it either inherits a value or is assigned a value by
- defining the macro zzdef0. This value can, of course, be altered
- within the subrule by using $0 on the left side of an assignment.
-
- $$ always refers to the $0 of the outermost scope - the rule
- scope (i.e $1.0). $$ can be referenced from any user action within
- any subrule.
-
- 4.1.4.2. $i.j references
-
- The EBNF subrules of PCCTS introduce scoping as in programming
- languages. To access attributes at scopes above (in an enclosing
- scope) from within a user action, $i.j variables are used where i is
- the scope and j follows the normal attribute numbering. Levels are
- numbered from 1 starting at the rule level. Each subrule increments
- the level. Actions at level i cannot access attributes at level i+1
- or below just like in a programming language. Attributes of the form
- $i are accepted as a shorthand. When a level is not indicated, the
- current scope will be assumed. For example,
-
-
-
-
- a : A B (C (D E) F) G ;
-
-
-
- The following table summarizes levels and attribute numbers:
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Page 44
-
- PCCTS 1.00
-
-
-
- ____________________________________________________
- ____________________________________________________|
- | A || $1.1 | $1 | rule a after A |
- |______||_____________|________|____________________|
- |______||_____________|________|____________________|
- | C || $2.1 | $1 | (C (D E) F) |
- |______||_____________|________|____________________|
- |______||_____________|________|____________________|
- | E || $3.2 | $2 | (D E) |
- |______||_____________|________|____________________|
- |______||_____________|________|____________________|
- | G || $1.4 | $4 | rule a after G |
- |______||_____________|________|____________________|
-
-
- In this example, actions at level 1 (rule level) cannot access attri-
- butes within any of the subrules. Actions before the subrules cannot
- access these lower attributes because they have not been created yet
- (the terminals have not been recognized). Actions following the
- subrules cannot access the lower attributes because they have all been
- collapsed into one attribute--namely, $3.
-
- 4.1.4.3. $label references
-
- Inheritance (parameters and return values) and rule-level $0
- attributes are referred to by name. $rule is identical to $$, but is
- more informative. The following two examples perform the same func-
- tion, but the first uses upward-inheritance and and the second uses
- $rule notation.
-
-
-
-
- #header <<#include "int.h">>
-
- <<main() { ANTLR(start(), stdin); } >>
-
- start : <<int r;>>
- expr > [r] "\n" <<printf("result %d\n", r);>>
- ;
-
- expr > [int result]
- : "[0-9]+" <<$result = $1;>>
- ( "\+" "[0-9]+" <<$result += $2;>> )*
- ;
-
-
-
- Note that the next version does not require a parameter definition
- since all rules implicitly return an attribute ($$, $rule).
-
-
-
-
-
-
- Page 45
-
- PCCTS 1.00
-
-
-
-
-
-
- #header <<#include "int.h">>
-
- <<main() { ANTLR(start(), stdin); } >>
-
- start : expr "\n" <<printf("result %d\n", $1);>>
- ;
-
- expr : "[0-9]+" <<$expr = $1;>>
- ( "\+" "[0-9]+" <<$expr += $2;>> )*
- ;
-
-
-
- Similarly, $parameter refers to one of the user-defined parameters.
-
- 4.1.5. $[token, text] attribute constructor
-
- This attribute is not automatically destroyed via zzd_attr if
- zzd_attr is defined. If necessary, the user must explicitly destroy
- these attributes. For example,
-
-
-
-
- #header <<#include "charptr.h">>
-
- <<
- #include "charptr.c"
- main() {ANTLR(a(), stdin);}
- >>
-
- a : <<Attrib b = $[0, "a string"];>>
- <<printf("b is %s\n", b);
- zzd_attr(&b); >>
- ;
-
-
-
- which would yield
-
-
-
-
- b is a string
-
-
-
- Note that the token number is irrelevant for charptr type attributes
- since its zzcr_attr is a function of the text only.
-
-
-
-
- Page 46
-
- PCCTS 1.00
-
-
- 4.1.6. Inheritance
-
- Inheritance refers to the ability of a rule to obtain information
- about the context in which it was invoked. Because ANTLR generates
- top-down parsers, information can be carried down through each rule
- invocation as well as returned. This ability, dubbed downward-
- inheritance, provides significant power because a rule can decide
- which rule invoked it or gain some other context altering information.
- LR parser-generators create parsers of the bottom-up variety and will
- never be able to pass information to a subrule because recognition
- begins at the leaves of the parse-tree and works it way upwards to the
- root (start symbol).
-
- 4.1.6.1. $0 inheritance
-
- $0 is one mechanism by which attributes can be inherited from and
- returned to an invoking block. The inheritance mechanism uses the
- attribute stack to initialize the $0 value seen by a subrule.
- Subrules can only pass one parameter which is implicitly typed as an
- attribute ($0). The $0 of a subrule is undefined unless set by a user
- action or by passing an attribute to that subrule. For example,
-
-
-
-
- decl : TypeID
- Var <<$2.type = $1;>>
- ( "," Var <<$2.type = $0;>> )*[$1]
- ;
-
-
-
- where the [ ] is an InheritAction mentioned above in the grammar for a
- PCCTS rule. The value inside is passed to the $0 of the subrule.
- This syntax is also used to pass parameters to rules. See below for
- more details. In this case, we are passing the type of a declaration
- (represented by the attribute associated with TypeID - $1 in the
- outermost scope) to a subrule which accepts a list of variables
- separated by commas. This could alternatively be accomplished via
-
-
-
-
- decl : TypeID
- Var <<$2.type = $1;>>
- ( "," Var <<$2.type = $1.1;>> )*
- ;
-
-
-
- or with a local variable.
-
-
-
-
-
-
- Page 47
-
- PCCTS 1.00
-
-
-
-
-
-
- decl : <<Attrib type;>>
- TypeID <<type = $1;>>
- Var <<$2.type = type;>>
- ( "," Var <<$2.type = type;>> )*
- ;
-
-
-
- Unlike subrules, rules employ the C hardware stack and $0 is
- never automatically set by inheritance. Despite this, the inheritance
- of a value for a rule's $0 can be accomplished by:
-
-
-
-
- rule[Attrib dummy]: <<$0=$dummy;>> remainder_of_rule ;
-
-
-
- This uses the more powerful rule communication mechanism and an
- assignment to $0 to simulate the inheritance mechanism used for
- subrules. Rules have a much more flexible inheritance scheme which,
- combined with rule-local variables for subrule inheritance, make the
- YACC-like attributes (Attrib's) useful primarily for communicating
- with the lexical analyzer.
-
- If a macro called zzdef0 is #define'd, it is executed upon entry
- to any block (rule or subrule) via zzdef0(&($0)). This macro can be
- used to ensure that all attributes have a certain property. For exam-
- ple, if $0 is not set in a rule, then the $i in the invoking rule will
- be undefined (some previous attribute typically). Setting the $0 ini-
- tially to NULL or some other meaningful value can be convenient. If
- attributes are string pointers, not assigning $0 to NULL before rule
- completion could cause an invoking rule to assume a valid pointer had
- been returned (as $i).
-
- Note that use of zzdef0 is incompatible with the use of $0 in
- subrules to inherit values, i.e., zzdef0 will destroy the inherited
- value.
-
- 4.1.6.2. Generalized rule communication
-
- PCCTS rules may pass parameters and return values just as C func-
- tions do (though, PCCTS rules can return multiple values). This
- mechanism can be viewed as generalized attribute handling where each
- rule defines its downward- and upward-inheritance attribute types.
-
- 4.1.6.2.1. Downward-inheritance
-
- The syntax for PCCTS parameter passing (downward-inheritance) is
-
-
-
- Page 48
-
- PCCTS 1.00
-
-
- identical to that of the ANSI C Standard [ANS90] except that the ( )
- is replaced by [ ]. Specifically, parameter declarations are comma-
- separated lists of single declarations:
-
-
-
-
- rule[type var , ..., type var ] : ... ;
- 1 1 n n
-
-
- The parameter declarations are broken down into old style C declara-
- tions when parsers are generated:
-
-
-
-
- rule(var , ..., var )
- type va1 ; n
- 1..; 1
- type var ;
- { n n
- /* Stuff to recognize rule */
- }
-
-
-
- which is recognized by both ANSI compliant and old-style C compilers.
-
- The parameters mentioned in the parameter declaration list may be
- used in user actions like any other variable just as in C except that
- they must be preceded by a $. The $ is consistent with references to
- the attributes associated with terminals and rule references. For
- example,
-
-
-
-
- a[int i, int j] : <<printf("%d\n", $i+$j);>> ;
-
-
-
- adds the two inherited attributes i and j and prints the result to
- stdout. It can be invoked in the following manner:
-
-
-
-
- rule : a[3,4]
- ;
-
-
-
-
-
-
-
- Page 49
-
- PCCTS 1.00
-
-
- 4.1.6.2.2. Upward-inheritance
-
- Return values are similarly declared via a comma-separated list
- of types and names enclosed in [ ] (following the parameter definition
- if any):
-
-
-
-
- rule[parameters] > [type var , ..., type var ] : ... ;
- 1 1 n n
-
-
- The names given in the return value definition may be used to expli-
- citly set the return values. These may be treated like $-prefixed
- variables and can referenced as well as set. The names in the return
- value definition may not conflict with any parameter or local variable
- defined elsewhere in that rule.
-
-
-
-
- a[int i, int j] > [int x, int y] : <<$x = $i+$j; $y = 0;>> ;
-
-
-
- Rule a can be invoked in the following manner:
-
-
-
-
- rule : <<int i,j;>>
- a[3,4] > [i,j]
- ;
-
-
-
- which passes 3 and 4 to rule a which returns 7 and 0 placing them into
- i and j respectively.
-
- 4.2. Abstract-syntax-tree construction and manipulation
-
- Many translation problems can be accomplished with actions embed-
- ded in a grammar that simply compute an output phrase and send it to
- the output stream. Unfortunately, most translation problems cannot be
- implemented in such a straightforward manner. For this reason, PCCTS
- supports an abstract-syntax-tree (AST) mechanism that provides a
- framework for constructing and manipulating intermediate representa-
- tions of the input stream.
-
- Without touching the grammar description, the user may direct
- PCCTS (by using the -gt command-line option) to automatically generate
- an AST of the input stream. By default, this AST is a flat tree (a
- list) with one node for each input token. Generally, the organization
-
-
-
- Page 50
-
- PCCTS 1.00
-
-
- of an AST reflects the language structure used to recognize an input
- phrase. A flat AST does not contain information about the language
- structure and, in effect, would only be a copy of the input. The
- default automatic tree construction can be modified by annotating the
- grammar with a few appropriately placed characters and is sufficient
- to handle many situations (such as arithmetic expressions) well.
-
- Some intermediate-forms cannot be adequately described by simple
- grammar annotations (e.g. type-trees for the C programming language).
- A more general tree construction mechanism is provided by PCCTS to
- allow explicit intermediate-form tree description and is described in
- section 4.2.4.
-
- AST nodes and subtrees, both automatically and explicitly con-
- structed, are referenced via #-variables. The notation and behavior
- of these variables is similar to that of the attribute $-variables.
- They are always implicitly typed as pointers to AST nodes and are gen-
- erally associated with rule and token references; rules yield subtrees
- and tokens yield nodes.
-
- PCCTS parsers manage the actual creation of nodes and the con-
- struction of the trees, but the user must describe what information an
- AST node is to carry and how to fill in that information.
-
- The command-line option -gt must be used to access any of the
- features is this section. Also note that the PCCTS automatically
- defines #define GENAST so that the user may write code that depends on
- whether or not trees are being constructed.
-
- 4.2.1. AST node creation
-
- PCCTS provides a default and an explicit node constructor. The
- default constructor is applied for each token in a rule defined
- without the ! modifier (which indicates that the rule should disable
- automatic tree construction) and is a function of the associated
- attribute, token and lexical text. The explicit constructor is user
- defined and can take a variable number of arguments.
-
- A macro called zzcr_ast() can be defined to give a default node
- definition or transformation. Also, user actions embedded in the
- grammar can modify AST fields. AST nodes have the structure
-
-
-
-
- struct _ast {
- struct _ast *right, *down;
- user_defined_fields
- };
-
-
-
- where user_defined_fields is filled in via the #define constant:
- AST_FIELDS. Only the user-defined fields should be modified by the
-
-
-
- Page 51
-
- PCCTS 1.00
-
-
- user as right and down are handled by the code inserted by PCCTS and
- could be subject to change. zzcr_ast() is of the form
-
-
-
-
- #define zzcr_ast(ast,attr,tok,text) *ast = F(attr, tok, text);
-
-
-
- with
-
-
-
-
- AST *ast;
- Attrib *attr;
- int tok;
- char *text;
-
-
-
- For example, a simple, but useful attribute and AST definition is
-
-
-
-
- #header <<
- typedef int Attrib;
- #define AST_FIELDS int i;
- #define zzcr_attr(attr, token, text) *(attr)=atoi(text)
- #define zzcr_ast(ast, attr, tok, text) (ast)->i = *(attr)
- >>
-
-
-
- which defines both attributes and AST node data to be integers. This
- default AST node construction merely copies the integer found on the
- input stream into the data field i of the new AST node.
-
- AST nodes may be explicitly constructed via a user-defined func-
- tion called zzmk_ast() which is defined as follows:
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Page 52
-
- PCCTS 1.00
-
-
-
-
-
- #include <varargs.h>
-
- /* fill in an AST node (which is passed in) and return a ptr to it */
- AST *zzmk_ast(va_alist)
- va_dcl
- {
- va_list ap;
- AST *node;
-
- va_start(ap);
- node = va_arg(ap, AST *); /* 1st arg is always a new AST node */
- ...
- /* fill in node */
-
- va_end(ap);
- return node;
- }
-
-
-
- Note that the K&R-style variable argument method is not required if
- you are not interested in portability or a constant number of parame-
- ters is passed (if only one node type is used, for example). A port-
- able ANSI C version would look like the following:
-
-
-
-
- #include <stdarg.h>
-
- /* fill in an AST node (which is passed in) and return a ptr to it */
- AST *zzmk_ast(AST *node, ...)
- {
- va_list ap;
-
- va_start(ap, node);
- ...
- /* fill in node */
-
- va_end(ap);
- return node;
- }
-
-
-
- To make an explicit node constructor that duplicates the simple
- integer AST node definition given above, the following would suffice
- (ANSI C version):
-
-
-
-
-
-
- Page 53
-
- PCCTS 1.00
-
-
-
-
-
- #header <<
- typedef int Attrib;
- #define AST_FIELDS int i;
- #define zzcr_attr(attr, token, text) *(attr)=atoi(text)
- #define zzcr_ast(ast, attr, tok, text) (ast)->i = *(attr)
- #include <stdarg.h>
- >>
-
- AST *zzmk_ast(AST *node, ...)
- {
- va_list ap;
-
- va_start(ap, node);
- node->i = va_arg(ap, int);
- va_end(ap);
- return node;
- }
-
-
-
- Now, an action may construct tree nodes via #[integer]. For example,
-
-
-
-
- a > [AST *node]
- : <<$node = #[34];>> /* return a node with 34 in it */
- ;
-
-
-
-
- The #[ ] node constructor is detailed in section 4.2.6.
-
- 4.2.2. AST node destruction
-
- A routine called zzfree_ast(AST *t) is used to free any
- abstract-syntax-tree created by an PCCTS parser. If defined, a macro
- called zzd_ast is applied to each AST node before its memory is
- free()'d. This can be used to free any additional memory that may
- have been allocated by zzcr_ast and attached to the AST nodes. The
- zzd_ast macro is passed the address of the node it is to destroy and
- is of the form
-
-
-
-
- #define zzd_ast( treeNodePtr ) /* perform operation on tree node */
-
-
-
-
-
-
- Page 54
-
- PCCTS 1.00
-
-
- zzd_ast must be defined in or included into the #header action.
- zzfree_ast() must be called explicitly by the user.
-
- 4.2.3. Automatic AST construction
-
- For small or uniform grammars (like C language expressions),
- PCCTS provides a concise tree construction mechanism. Token refer-
- ences are annotated to indicate
-
- o which tokens are to be subtree roots in the AST
-
- o which tokens are to be excluded from the AST
-
- Any non-modified (non-root and non-excluded) token is considered a
- leaf node. Rule names are never included in the automatically gen-
- erated AST's. The resulting tree is composed entirely of nodes
- derived from grammar terminals. Each rule yields one subtree.
- Subrules do not create subrules; elements enclosed within subrules
- modify the subtree for the current rule. Subtrees and nodes created
- from terminal references are accessed from user actions via #-
- variables which are discussed in section 4.2.6.
-
- 4.2.3.1. Grammar annotation
-
- PCCTS generates an AST by placing C code into the parser that, at
- parser run-time, builds a tree and makes it available to the user's
- program. AST's are stored in a child-sibling list form similar to
- that used by LISP. This allows each level in the AST to have one root
- and arbitrarily many children:
-
-
-
-
- (root child child ... child )
- 1 2 n
-
-
- Suffixing a token with ^ identifies it as a subtree root. Each
- root-token encountered increases the depth of the current subtree by
- 1. All further leaf tokens are considered siblings to the previous
- root node. A ! suffix on a terminal reference excludes the terminal
- from the AST (no AST node is created or linked in). Suffixing a rule
- reference with a ! indicates that the subtree created by that rule
- reference is not to be linked into the current subtree, but is still
- to be made available via #i variables. Rule references yield a
- pointer to the root of the subtree created by that rule invocation.
- Therefore, the subtree roots returned by rules become siblings in the
- current rule. Rules defined with a ! immediately following the rule
- name (and before the :) do not automatically construct trees. A ! in
- the rule definition is like placing a ! after each grammar element
- within that rule. #-variables do not exist for tokens suffixed with
- !. Hence, tree nodes are not automatically constructed within these
- rules although references to other rules can still create subtrees
- that can be accessed via #-variables.
-
-
-
- Page 55
-
- PCCTS 1.00
-
-
- To demonstrate the actual construction of AST's, consider the follow-
- ing grammar fragment.
-
-
-
-
- x : A B^ C! D y
- ;
-
- y : E^ F
- ;
-
-
-
- The grammar matches
-
-
-
-
- A B C D E F
-
-
-
- and considers B to be a subtree root. Token C is to be ignored. Rule
- y matches E F and returns a subtree of depth 2 to rule x where the
- root of rule y's subtree becomes available as #5. The following fig-
- ure sequence shows the state of the AST after each token has been
- recognized. The various AST nodes are labeled with the #-variable
- that could be used to access that node. The LISP-like child-sibling
- notation is given after the input state.
-
-
- [Figure omitted]
-
-
- A o B C D E F, ( A )
-
-
-
- [Figure omitted]
-
-
- A B o C D E F, ( B A )
-
-
-
- [Figure omitted]
-
-
- A B C o D E F, ( B A )
-
-
-
- [Figure omitted]
-
-
-
- Page 56
-
- PCCTS 1.00
-
-
- A B C D o E F, ( B A D )
-
-
-
- [Figure omitted]
-
-
- A B C D E F o, ( B A D ( E F ) )
-
-
- Because of the ! suffix, terminal C does not appear in the AST. Ter-
- minal B is a root-token because of the ^ modifier and is therefore the
- root of the subtree created for rule x. The root of the subtree
- created by rule y is a sibling of the leaf nodes of rule x. The way
- in which rules describe a grammar effects AST construction. e.g.
- rules x and y could be rewritten to recognize the same input, but to
- construct a different tree.
-
-
-
-
- x : A B^ C! D E^ y
- ;
-
- y : F
- ;
-
-
-
- In this case the root-token E will become the root of the tree matched
- up to that point (i.e. ( B A D )) yielding
-
-
-
-
- ( E ( B A D ) )
-
-
-
- The final AST would be
-
-
-
-
- ( E ( B A D ) F )
-
-
-
- 4.2.3.2. Operator precedence
-
- Operator associativity in the AST's is directly related to the
- precedence implicitly defined among the rules. For example, to specify
- that the exponentiation operator, **, groups right-to-left, a grammar
- similar to the following would be required.
-
-
-
- Page 57
-
- PCCTS 1.00
-
-
-
-
-
-
- expr : exp0 { "\*\*"^ expr }
- ;
- exp0 : "[0-9]+"
- ;
-
-
-
- The input
-
-
-
-
- 3**4**5
-
-
-
- would yield
-
-
-
-
- ( \*\* 3 ( \*\* 4 5 ) )
-
-
-
- indicating that the 4**5 would be performed first.
-
- For a full explanation of arbitrary tree construction see the
- section on explicit AST construction. However, the following example
- is beneficial in this section. Give a grammar fragment that recog-
- nizes a simple assignment, a tree of the form: ( := expr lvalue ) can
- be constructed via:
-
-
-
-
- assign! : lvalue ":=" expr ";" <<#0 = #(#[$2], #3, #1);>>
- ;
-
-
-
- The ! after the rule definition tells PCCTS to turn off automatic tree
- construction for this rule. #[$2] is an AST node constructed from an
- attribute.
-
- 4.2.3.3. Example
-
- The following simple grammar illustrates the automatic construc-
- tion of AST's, the definition/creation of AST nodes, and the use of #i
- variables
-
-
-
- Page 58
-
- PCCTS 1.00
-
-
-
-
-
-
- #header <<
- typedef int Attrib;
- #define AST_FIELDS int i, token;
- #define zzcr_attr(attr, token, text) *(attr)=atoi(text)
- #define zzcr_ast(ast, attr, tok, text) {(ast)->i = *(attr); (ast)->token=0;}
- >>
-
-
-
-
-
-
-
- <<
- AST *root = NULL; /* define a root ptr for AST; must be NULL */
-
- void show(tree) /* define function to show a tree node */
- AST *tree;
- {
- if ( tree->token != 0 ) printf(" %s", zztokens[tree->token]);
- else printf(" %d", tree->i);
- }
- void before() { printf(" ("); }
- void after() { printf(" )"); }
-
- main()
- {
- ANTLR(expr(&root), stdin); /* get an expr AST by passing &root */
- zzpre_ast(root, show, before, after);/* print out in LISP notation */
- }
- >>
-
-
-
-
-
-
-
- expr : exp0 ( "\+"^ <<#1->token=LA(1);>> exp0 )* "\n"! ;
- exp0 : exp1 ( "\*"^ <<#1->token=LA(1);>> exp1 )* ;
- exp1 : "[0-9]+"
- ;
-
-
-
- The actions in rules expr and exp0 set the token field (defined in
- AST_FIELDS) to the current token of look-ahead (LA(1)). This tells
- the predefined astpreorder() function to print the token value and not
- the number value (field i). Note that we must pass the address of a
- root pointer to expr so that expr can make it point to the subtree
-
-
-
- Page 59
-
- PCCTS 1.00
-
-
- constructed for that rule. The newline character is not included in
- the trees because of the ! suffix. The + and the * characters are
- considered operators in our language and so they are modified by ^.
- The job performed by the default AST node creation macro zzcr_ast
- could have just as easily been accomplished by removing zzcr_ast and
- changing rule exp1 to be
-
-
-
-
- exp1 : "[0-9]+" <<#1->i = $1; #1->token = 0;>>
- ;
-
-
-
- which is somewhat more explicit, but would be inconvenient if many
- such actions were required.
-
- If the input to this PCCTS parser example were 3+4, the program
- would print out
-
-
-
-
- ( \+ 3 4 )
-
-
-
- which represents
-
-
- [Figure omitted]
-
-
- or, in child-sibling form
-
-
- [Figure omitted]
-
-
- Note that the AST is significantly different than the parse-tree which
- can be represented by
-
-
- [Figure omitted]
-
-
- An input of 3+4*5 yields
-
-
-
-
- ( \+ 3 ( \* 4 5 ) )
-
-
-
-
- Page 60
-
- PCCTS 1.00
-
-
- which represents
-
-
- [Figure omitted]
-
-
- or, in child-sibling form
-
-
- [Figure omitted]
-
-
- The 4*5 would be performed before the addition because one of the
- addition's operands is itself an operation (i.e. a multiply).
-
- 4.2.4. Explicit AST construction
-
- The automatic tree construction scheme available with PCCTS is
- useful for building AST's whose structure is similar to that of the
- grammar parse trees. It is not suited to applications where the
- structure of the grammar bears no resemblance to the structure
- required in the intermediate-form. C declarations are a good example
- of this and a sample "C declaration to English" translator is
- developed in this section.
-
- Reasonable tree-construction and tree-rewriting systems, such as
- Purtilo and Callahan's [PuC89] NewYacc, have been developed but most
- rely on bottom-up parser generators like YACC to build a parse-tree in
- memory which is later traversed to simulate inherited attributes.
- PCCTS parsers construct AST's while parsing the input. Rules can pass
- subtrees as arguments to other rules and construct AST's without first
- having to build a parse-tree. PCCTS supports general AST construction
- via two simple tree description constructs - AST node constructors and
- AST tree definitions.
-
- The AST node constructor is similar to the constructor for attri-
- butes ($[token,text]) except that AST node constructors yield pointers
- to AST nodes whereas attribute constructors yield attributes (of type
- Attrib). #[args] creates an AST node and passes a pointer to it to
- zzmk_ast() with args as arguments. Constructors may be nested to
- create AST nodes from attributes; e.g. #[ $[token,text] ].
-
- Trees are defined using a LISP-like notation because of its brev-
- ity.
-
-
-
-
- #( root, child , child , ..., child )
- 1 2 n
-
-
- where root and child are pointers to AST nodes which can themselves
- be trees or node iconstructors. The return tree for any rule is #0
-
-
-
- Page 61
-
- PCCTS 1.00
-
-
- which can appear on both the left- and right-hand side of an assign-
- ment statement. Automatic construction of trees will proceed unless
- the ! is included in the rule definition to override the normal AST
- mechanism. The same grammar may have some rules that use the
- automatic tree mechanism and other rules that explicitly construct
- trees. However, the methods must not be mixed within the same rule.
- In other words, do not assign a tree to #0 unless you have turned off
- the automatic tree construction. When default AST construction is
- turned off, #-variables do not exist for terminals or rule references
- except for references to rules that construct trees (explicitly or
- automatically). Nodes for terminals must be explicitly created just
- as the tree structure must be explicitly defined.
-
- 4.2.4.1. Simple example
-
- We present a trivial example in this section to illustrate the
- AST node and tree constructors.
-
-
-
-
- #header <<
- #include "int.h"
- #define AST_FIELDS int i;
- >>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Page 62
-
- PCCTS 1.00
-
-
-
-
-
-
- <<
- AST *root = NULL; /* define a root ptr for AST; must be NULL */
-
- void show(tree) /* define function to show a tree node */
- AST *tree;
- {
- printf(" %d", tree->i);
- }
- void before() { printf(" ("); }
- void after() { printf(" )"); }
-
- AST *zzmk_ast(node, v)
- AST *node;
- int v;
- {
- node->i = v;
- return node;
- }
-
- main()
- {
- ANTLR(tree(&root), stdin);
- zzpre_ast(root, show, before, after);
- }
- >>
-
-
-
-
-
-
- tree! : NUM NUM NUM "\n"
- <<#0 = #( #[$[NUM,"101"]], #[$1], #[$2], #[$3] );>>
- ;
-
- #token "\ " <<zzskip();>>
- #token NUM "[0-9]+"
-
-
-
- Given the input 3 4 5, rule tree would return
-
-
-
-
- ( 101 3 4 5 )
-
-
-
- The rule itself matches three integers followed by a newline. As
-
-
-
- Page 63
-
- PCCTS 1.00
-
-
- usual, the four attributes $1-$4 exist, but #1-#4 do not because rule
- tree was defined with a ! modifier. To create AST nodes for the ter-
- minals we use the node construct #[$i] which returns a new node con-
- structed through zzmk_ast. The newline is not to be included in the
- tree so no AST node was constructed for it. To illustrate the attri-
- bute constructor, a root node was created as a function of an attri-
- bute with a value of 101, though #[101] would have been easier. There
- is no corresponding terminal in the grammar for this AST node and
- therefore one was created. $[NUM,"101"] creates an attribute through
- zzcr_attr which is passed to zzmk_ast when enclosed in #[]. The sub-
- tree created in rule tree is explicitly returned by making #0 point to
- the root of the subtree.
-
- 4.2.4.2. C declaration to English translator
-
- The type declaration syntax of C is somewhat bizarre and can be
- troublesome to even experienced programmers. The syntax of declara-
- tion mirrors that of use. In other words, variables are defined the
- way they would be used in an expression. We give a loose description
- of how to correctly "parse" a C declaration and then we present a com-
- plete translator to describe, in English, what the declaration means.
-
- Kernighan and Ritchie [KeR88] give an excellent description of
- how to understand C declarations and they present a program to convert
- from C to a word description. To parse a declaration as a Human, we
- follow a few simple rules.
-
- (i) Start parsing at the variable name.
-
- (ii) Scan to the right. For each [expr] or () encountered say "n-
- element array of" or "function returning" respectively where n is
- the (optional) expression found inside the brackets. Continue
- scanning until a right-parenthesis or a semicolon.
-
- (iii)Now, scan left. For each * say "pointer to" until you see a type
- name or a left-parenthesis.
-
- (iv) Move to the next outer nesting level if one exists and begin
- again at rule (ii). If no more declaration symbols exist, look
- to the left and say the type name (ignoring storage classes
- here).
-
- For example, let us parse
-
-
-
-
- int *(*a[10])();
-
-
-
- By (i), we begin at a. Looking to the right in (ii), we see brackets
- and so we say "10-element array of." The right-parenthesis makes us
- jump to rule (iii) and scan left. We see a * and say "pointer to."
-
-
-
- Page 64
-
- PCCTS 1.00
-
-
- The left-parenthesis finishes off (iii) and rule (iv) tells us to exit
- this level of parenthesis (nesting) and start again at (ii). Scanning
- to the right we see () and say "function returning." The semicolon
- forces us to (iii) where we see a * and say "pointer to." (iv) tells
- us to say "integer." Packing all of the "sayings" together yields:
- "10-element array of pointer to function returning pointer to integer"
- or, in a flat tree representation,
-
-
-
-
- ( [10] ( * ( () ( * int ) ) ) )."
-
-
-
- Our example parser will construct trees of a similar form but with
- result in a little better English.
-
- To build a type tree for C we "unroll" the nesting of the
- declaration so that it can be read left-to-right. Our trees will have
- the variable name at the root and the type as the leaf. Each node in
- the tree will be a type-qualifier. The type-qualifiers are one of
- [expr], (), * and are linked together in a parent/child relationship.
- e.g. ( i ( * int ) ) means that i is a pointer to an integer.
-
- The translator uses simple character buffers as attributes and
- AST node data. The trees are constructed at parse-time and later
- traversed to print out the description. A function to find the bottom
- of a parent/child list is required to augment the normal tree con-
- struction features of PCCTS.
-
-
-
-
- #header <<
- #include "charbuf.h"
- #define AST_FIELDS Attrib attr;
- >>
-
- <<
- AST *root = NULL;
-
- AST *zzmk_ast(node,attr)
- AST *node;
- Attrib attr;
- {
- node->attr = attr;
- return node;
- }
-
- main() { ANTLR(start(&root), stdin); }
-
-
-
-
-
-
- Page 65
-
- PCCTS 1.00
-
-
-
-
-
-
- AST *bottom(t)
- AST *t;
- {
- if ( t==NULL ) return NULL;
- if ( t->down != NULL ) return bottom(t->down);
- return t;
- }
-
- english(tree)
- AST *tree;
- {
- printf("%s is", tree->attr.text);
- engl(tree->down, 0);
- }
-
-
-
-
-
-
- engl(tree, plural)
- AST *tree;
- int plural;
- {
- if ( tree == NULL ) return;
- if ( strcmp(tree->attr.text, "*") == 0 )
- printf(plural?" pointers to":" a pointer to");
- else if ( strcmp(tree->attr.text, "nodim") == 0 ) {
- printf(plural?" arrays of":" an array of");
- plural = 1;
- }
- else if ( strcmp(tree->attr.text, "int") == 0 )
- printf(plural?" integers\n":" an integer\n");
- else if ( strcmp(tree->attr.text, "ret") == 0 )
- printf(plural?" functions returning": " a function returning");
- else {
- if ( plural ) printf(" %s-element arrays of", tree->attr.text);
- else printf(" a %s-element array of", tree->attr.text);
- plural = 1;
- }
- engl(zzchild(tree), plural);
- engl(zzsibling(tree), plural);
- }
- >>
-
-
-
-
-
-
-
-
-
- Page 66
-
- PCCTS 1.00
-
-
-
-
-
-
- #token "[\ \t\n]+" <<zzskip();>>
-
- start! : ( d <<english(#1); zzfree_ast(#1);>> )*
- ;
-
- d! : <<AST *word;>>
- "int" d1[#[$1]] > [word] ";" <<#0 = #(word, #2);>>
- ;
-
- d1![AST *type] > [AST *word]
- : "\*" d1[#(#[$1], $type)] > [$word] <<#0 = #2;>>
- | d2[$type] > [$word] <<#0 = #1;>>
- ;
-
- d2![AST *type] > [AST *word]
- : WORD d3[$type] <<$word = #[$1]; #0 = #2;>>
- | "\(" d1[NULL] > [$word] "\)" d3[$type]
- <<#( bottom(#2), #4 ); #0 = (#2==NULL)? #4 : #2;>>
- ;
-
- d3![AST *type]
- : "\[" NUM "\]" d3[type] <<#0 = #( #[$2], #4 );>>
- | "\[" "\]" d3[type] <<#0 = #( #[$[WORD,"nodim"]], #3 );>>
- | "\(" "\)" <<#0 = #( #[$[WORD,"ret"]], type );>>
- | <<#0 = type;>>
- ;
-
- #token NUM "[0-9]+"
- #token WORD "[a-z]+"
-
-
-
- The rules of this grammar perform the following functions
-
- startMatch multiple declarations translating each one to English and
- freeing each tree.
-
- d Only integers are recognized for simplicity. Declarations begin
- with "int" and are terminated by a ";". Rule d1 returns a
- pointer to the AST containing the symbol name found in the
- declaration (which has not yet been added to the tree). Rule d
- returns the declaration found in rule d1 with the symbol name as
- the new root. The type node is always the leaf and is passed to
- rule d1 which builds everything on top of the type node.
-
- d1 The * is the lowest priority symbol in a declaration and hence we
- "look" to the right first. The * is appended to whatever is
- found to the right. This is accomplished by passing the * and
- the currently built tree to rule d2. Since we are scanning from
- the left, but must translate as if starting from the most deeply
-
-
-
- Page 67
-
- PCCTS 1.00
-
-
- nested structure, we must keep adding type-qualifiers on top of
- the root. The current tree is always passed down. Rule d1
- returns a pointer to the AST node containing the variable name.
-
- d2 Recognize a variable name followed by an array/function descrip-
- tor or recognize a parenthesized declarator. Rule d2 creates an
- AST node for the variable name and returns it. The tree con-
- structed by rule d3 is also returned. We need to attach the
- array/function descriptor to the bottom of the parenthesized
- declarator which could be arbitrarily large. For this reason, we
- use the bottom() function to find the end of the declarator
- returned by the reference to d1.
-
- d3 Match an array or function descriptor. Note that two tokens of
- look-ahead are required to decide between alternatives one and
- two since both begin with a left-bracket. Tail-recursion is used
- to get a sequence of [] or () descriptors. Array descriptors
- like [34] are converted to tree nodes containing just the number
- of elements. If no size is present, "nodim" is substituted for
- the number in the AST node. If nothing is seen by rule d3, it
- returns the type that was passed in since nothing was appended.
- Function descriptors form type-qualifiers labeled "ret".
-
- It is interesting to note that the grammar actions required to con-
- struct the type trees were small and mostly assignments. The explicit
- tree constructors allow complicated trees to be built without a great
- deal of C code.
-
- The translator can be made using the following makefile.
-
-
-
-
- # Makefile for C -> English translator
- CFLAGS= -I../h
- AFLAGS= -gt -k 2
- GRM=decl
- SRC=scan.c $(GRM).c err.c
- OBJ=scan.o $(GRM).o err.o
-
- test: $(OBJ) $(SRC)
- cc -o $(GRM) $(OBJ)
-
- $(GRM).c parser.dlg : $(GRM).g
- antlr $(AFLAGS) $(GRM).g
-
- scan.c : parser.dlg
- dlg -C2 parser.dlg scan.c
-
-
-
- Our translator knows how and when to make things plural so it gen-
- erates correct English. For example,
-
-
-
-
- Page 68
-
- PCCTS 1.00
-
-
-
-
-
-
- int *(*i[5])();
-
-
-
- yields
-
-
-
-
- i is a 5-element array of pointers to functions returning pointers to integers
-
-
-
- 4.2.5. Tree manipulation routines
-
- PCCTS generates code that calls the following set of tree manipu-
- lation routines to create AST's. The user may modify the routines to
- suit their own purposes. The routines and AST node definitions are
- located in ast.c and ast.h respectively and reside in the PCCTS
- include directory. They are automatically included when the ANTLR -gt
- command-line option is specified. Routines prefixed with underscore
- are not normally called by the user.
-
- void _link(_root, _sibling, _tail)
- Called to ensure tree manipulation variables for current rule are
- valid after a rule reference.
-
- AST *_astnew()
- Create a new AST node and return it.
-
- void _child(_root, _sibling, _tail)
- Add a leaf node to the current AST. Creates an AST node, applies
- the default node transformation (zzcr_ast) if it exists, and
- appends the node to the current sibling list. After this func-
- tion, #i variables are available for user actions.
-
- void _subroot(_root, _sibling, _tail)
- Perform the same operations as _child(), but make the newly-
- created node the root for the current sibling list. If a root
- node exists, make the newly-created node the root of the current
- root.
-
- void zzpre_ast(tree, apply, before, after)
- Perform a preorder traversal of tree applying apply to each node.
- Apply function before before the start of each new subtree and
- function after after you complete each subtree.
-
- A common use of this function is to print out a tree. For exam-
- ple, if AST nodes had a field called token, the tree of tokens
- could be printed by defining the following
-
-
-
- Page 69
-
- PCCTS 1.00
-
-
-
-
-
-
- void show(tree)
- AST *tree;
- {
- if ( tree == NULL ) return;
- printf(" %s", zztokens[tree->token]);
- }
-
- void before() { printf(" ("); }
- void after() { printf(" )"); }
-
-
-
- and then calling zzpre_ast() as
-
-
-
-
- zzpre_ast(some_tree, show, before, after);
-
-
-
- AST *zzdup_ast(tree)
- Return a duplicate of the tree passed in.
-
- void zzfree_ast(tree)
- Free the AST pointed to by tree applying zzd_ast to each node
- before freeing if it is defined.
-
- zzrm_ast
- This macro is used to free the subtree for the current rule
- (using zzfree_ast()) and remove all references to it. PCCTS
- maintains local pointers to the current sibling list etc... which
- must be NULL'ed in order to begin a new subtree. Translators
- that perform an operation on a list of items using ()* or ()+ can
- add an action to free and remove references to the AST associated
- with the different items upon each iteration of the loop. e.g.
- (item <<zzrm_ast;>>)+.
-
- zzchild(tree)
- Return the first child of the node pointed to by tree. Returns
- NULL if tree is NULL.
-
- zzsibling(tree)
- Return the immediate sibling of the node pointed to by tree.
- Returns NULL if tree is NULL.
-
-
- The parameters tree, _root, _sibling and _tail are defined as
-
-
-
-
-
- Page 70
-
- PCCTS 1.00
-
-
-
-
-
-
- AST *tree, **_root, **_sibling, **_tail;
-
-
-
- A useful modification to _astnew() would be to use custom allocation
- of AST nodes. Since they are all the same size, it could be done
- quickly and efficiently.
-
- 4.2.6. AST references
-
- References to AST objects come in three basic forms.
-
- #i This attribute refers to the tree constructed for the ith rule
- element within the current alternative. #i variables are
- scoped/numbered just like $i variables and their association dis-
- solves after leaving the current attribute scope. #i is unique
- until the current block is exited or another iteration of a ( )*
- or a ( )+ loop has begun.
-
- #0 is special and always refers to the root of the subtree asso-
- ciated with the rule in which it was referenced. #0 is scoped
- like $$, not like $0; there is only one #0 for the entire rule.
- If #0 is ever set to NULL, the current subtree disappears
- (without freeing the nodes). Normally, #0 is only set when the
- automatic AST tree creation is turned off with a ! after the
- rule name in a definition.
-
- Since subrules do not have their own #0 values. The #i value of
- a subrule is undefined. However, for compatibility with $i
- numbering, subrules are counted as though they were assigned #i
- values:
-
- An example #-variable number is as follows.
-
-
-
-
- a : B
- ( C D /* #1 is C, #2 is D */ )*
- E
- /* #1 is B, #2 does not exist, #3 is E */
- ;
-
-
-
- When the automatic tree construction is turned off for a rule,
- #-variables for terminals are undefined since nodes are not
- created for them. #-variables always exist for rule references
- since the referenced rule may construct a tree (implicitly or
- explicitly).
-
-
-
- Page 71
-
- PCCTS 1.00
-
-
- #[args]
- This form constructs an AST node and returns a pointer to it (of
- type AST *). The arguments and the newly-constructed node are
- passed to the zzmk_ast() function defined by the user.
-
- #(root, child , child , ..., child )
- This for_ repres_nts a tree d_scriptor from a set of AST node
- pointers. It is translated to a function call that links all the
- child nodes together as siblings, with root as their parent. A
- pointer to the root is returned. If the root is NULL, a pointer
- to the sibling list is returned. #( NULL ) and #() both return
- NULL. #( nodeptr ) yields nodeptr. All child'ren become
- siblings of each other and any siblings of the child'ren that
- existed previously become siblings also. For example,
-
-
-
-
- #( NULL, #( NULL, #[a], #[b]), #[c] )
-
-
-
- results in a, b and c all being siblings.
-
- 4.2.7. Invoking parsers that construct trees
-
- PCCTS parsers pass an address of a root pointer to other rules
- while constructing AST's. This root pointer is always the first
- parameter and will point to the subtree built for that rule upon rule
- completion. Any parameters defined for a rule are not affected by
- this additional rule argument. All root pointers must be set to NULL
- initially. Normally, PCCTS generates code to handle all pointer
- parameter passing and tree manipulations. However, the starting rule
- (i.e. the one called using the ANTLR macro) may try to reference the
- root pointer it presumes was passed in. Therefore, the address of a
- NULL-initialized root pointer must be manually passed in. e.g.
-
-
-
-
- AST *root = NULL;
-
- main()
- {
- ANTLR(start(&root), stdin);
- }
-
- start : stuff ;
-
-
-
- This root parameter will be not be defined in rule start if the -gt
- command-line option is not used.
-
-
-
-
- Page 72
-
- PCCTS 1.00
-
-
- Note that if main() is located in a file other than a grammar
- file, the user must explicitly include ast.h.
-
- 4.2.8. Error recovery and AST's
-
- When a syntax error is discovered by an ANTLR-generated parser,
- no further additions will be made to the subtree of the current rule.
- Additions will resume after "successful" error recovery. The AST is
- always left in a stable state (i.e. a valid child-sibling tree) so
- that user actions may examine and/or modify the tree. This also
- ensures that syntax errors will not result in ill-formed trees.
-
- 5. Error detection and repair
-
- Reasonable error recovery is a notoriously difficult task which
- is handled very poorly, or not at all, by most parser-generator sys-
- tems. In part, this may be due to the fact that most systems con-
- struct LR parsers, and these are theoretically less able to recover
- from errors than LL parsers [FiL88]. It can also be justified on the
- basis that a parser can be used to quickly find the first error, so
- that an interactive user can repair the error and recompile without
- having to read through a multitude of additional error messages which
- might represent the parser's confusion rather than actual errors.
-
- In contrast, PCCTS attempts to provide a simple, automatic, error
- reporting and recovery mechanism. Currently, ANTLR-generated parsers
- recover from syntax errors by consuming tokens until a token in the
- FOLLOW set of X is discovered where X is the rule in which the error
- was discovered. Errors are reported in terms of the set of tokens
- which could legally have appeared at that point.
-
- A syntax error occurs when a PCCTS parser discovers a token on
- the input stream that cannot be matched to any currently recognizable
- rule element. Because PCCTS performs grammar analysis on the user's
- grammar description, it is aware of all possible tokens that can be
- recognized at a given instant. Sets representing these tokens are
- passed to an error reporting function when a syntax error is
- discovered. Optionally, a fail-action can be executed.
-
- [Wir76] provides a simple mechanism for recovering from syntax
- errors in LL(1) parsers by consuming tokens until the current token is
- found to be a member of a "stopping set." This stopping set is
- comprised of the FOLLOW set for that rule and any user-defined marking
- ("resynch") tokens that are never to be skipped (e.g. BEGIN). PCCTS
- currently employs a simple variant of the [Wir76] mechanism. When a
- syntax error occurs in any production of rule X, the parser will con-
- sume input until the current token is a member of the stopping set for
- X.
-
- The syntax error recovery mechanism employed by PCCTS does not
- effect parser recognition speed unless an error is discovered. PCCTS
- error recovery is attractive because it is completely automatic, but
- it tends to delete groups of tokens which may lead to erroneous resyn-
- chronizations. This "glutton" scheme could be refined by allowing the
-
-
-
- Page 73
-
- PCCTS 1.00
-
-
- user to specify a set of resynch tokens to augment the FOLLOW set
- [Wir76]. These resynch tokens would be determined by experience.
-
- Sophisticated error recovery is possible using k tokens of look-
- ahead. Future versions of PCCTS might employ this large look-ahead
- for intelligent error recovery. Hence, it can be expected that
- automatic error handling will improve without requiring users to
- change the input to PCCTS.
-
- 5.1. Modifying grammars to recognize common syntax errors
-
- The user may modify a given grammar to recognize common, but
- erroneous sentences. For example, if rule expr8 below were part of an
- expression recognizer for a programming language,
-
-
-
-
- expr8 : VAR
- | INT
- | FUNC "\(" elist "\)"
- ;
-
-
-
- it could be modified to print an error message whenever an undefined
- variable was seen. In other words,
-
-
-
-
- expr8 : VAR
- | INT
- | FUNC "\(" elist "\)"
- | WORD <<error("Undefined variable: %s", $1);>>
- ;
-
-
-
- Without this additional production, the parser would print something
- like (assuming test was the undefined variable found):
-
-
-
-
- syntax error at "test" missing { VAR INT FUNC }
-
-
-
- The extra production would allow a more informative error message to
- be printed, such as:
-
-
-
-
-
-
- Page 74
-
- PCCTS 1.00
-
-
-
-
-
-
- Undefined variable: "test"
-
-
-
- 5.2. Default error reporting function
-
- The default error reporting function, zzsyn(), is defined as fol-
- lows.
-
-
-
-
- void
- zzsyn(char *text, /* text of current token */
- int tok, /* current token */
- char *egroup, /* label attached to rule with error if present */
- unsigned *eset, /* etok==0 -> >1 token missing; set is eset */
- int etok) /* etok>0 -> 1 token missing; token is etok */
- {
- fprintf(stderr, "syntax error at \"%s\"", text);
- fprintf(stderr, " missing ");
- if ( !etok ) zzedecode(eset);
- else fprintf(stderr, "%s", zztokens[etok]);
- if ( strlen(egroup) > 0 ) fprintf(stderr, " in %s", egroup);
- fprintf(stderr, "\n");
- }
-
-
-
- A list of the regular expression or label associated with each
- token value is stored in the char *zztokens[] array in err.c.
- zzedecode() is a predefined function that decodes the set of tokens
- stored in the bit-set eset.
-
- zzsyn() can be modified according to the user's needs. For exam-
- ple, it can be modified to print out the file and line number
- (zzline).
-
- 5.3. Error groups
-
- The user may attach a string to a non-terminal that is passed on
- to the error reporting macro/function zzsyn() by placing it just
- before the : in a rule definition. e.g.
-
-
-
-
- declarator "declaration or definition" : ... ;
-
-
-
-
-
- Page 75
-
- PCCTS 1.00
-
-
- Although zzsyn() can be redefined by the user, the default error
- reporting facility appends the specified string to the error message.
- More than one rule may share the same string. This mechanism can gen-
- erate decent error messages. For example, one could specify that all
- rules dealing with expressions (e.g. factor, term, conditional) are to
- be labeled as "expression". The default zzsyn() would then yield
- something like:
-
-
-
-
- syntax error at "NUM" missing { "\+" "\*" } in expression
-
-
-
- as opposed to
-
-
-
-
- syntax error at "NUM" missing { "\+" "\*" }
-
-
-
- Another example might be that all rules used to define variables, such
- as rule declarator above, should be labeled
- "declaration or definition".
-
- 5.4. Error token classes
-
- The default syntax error reporting facility generates a list of
- tokens that could be possibility matched when the erroneous token was
- encountered. Often, this list is large and means little to the user
- for anything but small grammars. For example, an expression recog-
- nizer might generate the following error message for an invalid
- expression, "a . b":
-
-
-
-
- syntax error at "." missing { "\+" "\-" "\*" "\/" "\&" "\|" "\<" "\>" "\=" ";" }
-
-
-
- A better error message would be
-
-
-
-
- syntax error at "." missing { operator ";" }
-
-
-
- This modification can be accomplished by defining the error class:
-
-
-
- Page 76
-
- PCCTS 1.00
-
-
-
-
-
-
- #errclass "operator" { "\+" "\-" "\*" "\/" "\&" "\|" "\<" "\>" "\=" }
-
-
-
- 5.4.1. Error class definitions
-
- The general syntax for #errclass is as follows:
-
-
-
-
- #errclass label { e e ... e }
- 1 2 n
-
-
- where label is either a quoted string or a label (capitalized just
- like token labels). The quoted string must not conflict with any rule
- name or regular expression. Groups of expressions will be replaced
- with this string and it must not appear to be a simple token. Simi-
- larly, an error class label may not share the same name as a labeled
- token. The error class elements, e , can be a labeled token, a regu-
- lar expression or a non-terminal. T_kens (labeled or regular expres-
- sions) referenced within an error class must at some point in the
- grammar be referenced in a rule or explicitly defined with #token.
- The definition need not appear before the #errclass definition. If a
- non-terminal is referenced, the FIRST set (set of all tokens that can
- be recognized first upon entering a rule) for that rule is instan-
- tiated into the error class. The -ge command-line option can be used
- to have PCCTS generate an error class for each non-terminal of the
- form:
-
-
-
-
- #errclass Rule { rule }
-
-
-
- which implies that each non-terminal will have an error class composed
- of the first set for that rule. The error class name is the same as
- the non-terminal except that the first character is converted to upper
- case.
-
- Error class names may also be used as error class elements (e )
- since error class names are treated somewhat like tokens. This capa-
- bility allows error class hierarchies. For example,
-
-
-
-
-
-
-
- Page 77
-
- PCCTS 1.00
-
-
-
-
-
-
- #errclass Fruit { CHERRY APPLE }
- #errclass Meat { COW PIG }
- #errclass "stuff you can eat" { Fruit Meat }
-
- yum : (CHERRY | APPLE) PIE
- | (COW | PIG) FARM
- | THE (CHERRY | APPLE) TREE
- ;
-
-
-
- Different error messages will result depending upon where in rule a a
- syntax error is detected. If the input were
-
-
-
-
- THE PIG TREE
-
-
-
- the following error message would result:
-
-
-
-
- syntax error at "PIG" missing { Fruit }
-
-
-
- However, if the input were
-
-
-
-
- FARM COW
-
-
-
- the decent error message
-
-
-
-
- syntax error at "FARM" missing { "stuff you can eat" THE }
-
-
-
- would result. Note that without the error class definitions, the
- error message would have been:
-
-
-
- Page 78
-
- PCCTS 1.00
-
-
-
-
-
-
- syntax error at "FARM" missing { CHERRY APPLE COW PIG THE }
-
-
-
- which conveys the same information, but at a much lower level.
-
- 5.4.2. Error class utilization
-
- PCCTS attempts to construct sets of tokens for error reporting--
- error sets. The sets are created wherever a parsing decision will be
- made in the generated parser. At every point in the parsing process
- there exists a set of currently recognizable/acceptable terminals.
- This set can be decoded and printed out when a syntax error is
- detected - when the current token is not currently recognizable.
- PCCTS attempts to replace subsets of all error sets with error classes
- defined by the user. For example, rule a below contains a subrule
- with more than one alternative which implies that a parsing decision
- will be required at run-time to determine which alternative to choose.
-
-
-
-
- a : (Happy | Sad | Funny | Carefree) Person ;
-
-
-
- If, upon entering rule a, the current token is not one of the four
- terminals found in the alternatives, a syntax error will have occurred
- and the following message would be generated (if "huh" were the
- current token):
-
-
-
-
- syntax error at "huh" missing { Happy Sad Funny Carefree }
-
-
-
- Let us define an error class called Adjective that groups those same
- four tokens together.
-
-
-
-
- #errclass Adjective { Happy Sad Funny Carefree }
-
-
-
- Now, the error message would be:
-
-
-
-
- Page 79
-
- PCCTS 1.00
-
-
-
-
-
-
- syntax error at "huh" missing { Adjective }
-
-
-
- PCCTS repeatedly trys to replace subsets of the error set until
- no more substitutions can be made. At each replacement iteration, the
- largest error class that is completely contained within the error set
- is substituted for that group of tokens. One replacement iteration
- may perform some substitution that makes another, previously inviable,
- substitution possible. This allows the hierarchy mechanism described
- above in the error class description section. The sequence of substi-
- tutions for the yum example would be:
-
-
-
-
- [i] { CHERRY APPLE COW PIG THE }
- [ii] { Fruit COW PIG THE }
- [iii] { Fruit Meat THE }
- [iv] { "stuff you can eat" THE }
-
-
-
- The error class mechanism leads to smaller error sets and can be
- used to provide more informative error messages.
-
- 6. Things you should know about
-
- This section is a collection of important things that did not
- properly belong in any other section.
-
- 6.1. Available program symbols
-
- This section describes all symbols visible to the user that may
- be of interest.
-
- 6.1.1. Macros, functions, #define's
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Page 80
-
- PCCTS 1.00
-
-
-
-
-
-
- ZZCOL /* define so that DLG tracks column # of tokens */
- ZZLEXBUFSIZE /* if not set, default size of lexical bufs */
- ZZA_STACKSIZE /* if not set, default size of attrib stack */
- ZZAST_STACKSIZE /* if not set, default size of AST stack */
- zzTRACE(rule) /* macro called at start of rule when -gd set */
- LL_K /* == max # tokens of look-ahead used in parser */
- ANTLR(start,stream) /* ANTLR startup macro; get input from stream */
- ANTLRf(start,funcPtr) /* ANTLR startup macro; get input from func */
- zzcr_attr() /* create attribute from token, text on input */
- zzd_attr() /* call zzd_attr(&($i)) for each attrib created */
- zzcr_ast() /* how to create an AST node from attribute */
- zzmk_ast() /* how to create an AST node (user-defined) */
- zzd_ast() /* how to destroy an AST node */
- LA(i) /* ith token of look-ahead */
- LATEXT(i) /* text for ith token of look-ahead */
- zzlextext /* buffer holding text of current token */
- zzsyn() /* error reporting macro */
- zzpre_ast(tree, func1, func2, func3) /* see section 4.2.5 */
- zzfree_ast(tree) /* see section 4.2.5 */
- zzdup_ast(tree) /* see section 4.2.5 */
- zzedecode(unsigned *error_set); /* decode a bit-set */
- zzskip() /* Tell DLG to skip this token */
- zzmore() /* Tell DLG to try for more chars */
- zzmode(mode) /* switch DLG to another lex mode */
- zzadvance() /* Get next char allow only valid chars */
- zzrdstream(s) /* Reset lex_line, use stream s for input */
- zzrdfunc(f) /* Reset lex_line, call function f to read input */
- zzreplchar(c) /* replace current lex buffer with c */
- zzreplstr(s) /* replace current lex buffer with s */
-
-
-
- 6.1.2. Global variables
-
-
-
-
- char *zztokens[], /* zztokens[t] is rexpr or label for token t */
- *zzbegexpr, /* start of text for currently recognized expr */
- *zzendexpr; /* end of text for currently recognized expr */
- int zzline, /* set by user to current line # */
- zzbegcol, /* column # of 1st char in token */
- zzendcol, /* column # of last char in token */
- zzchar; /* current character of look-ahead */
- void (*zzerr)(); /* ptr to func to exec upon lexical error */
-
-
-
- One of the most common actions found among PCCTS descriptions is a
- statement similar to
-
-
-
- Page 81
-
- PCCTS 1.00
-
-
-
-
-
-
- #token "\n" << zzline++; >> /* Move to next line */
-
-
-
- 6.2. ANSI versus K&R
-
- K&R, actually Kernighan and Ritchie, "The C Programming
- Language," Prentice-Hall, 1978, was effectively the standard for the C
- language until the ANSI X3-J11 committee put forth a definition of
- ANSI C [ANS90]. Unfortunately, ANSI C is not compatible with K&R C,
- yet both dialects are in common use. For this reason, PCCTS supports
- generation of either ANSI C or K&R C.
-
- By default, PCCTS generates K&R C code with parameterless extern
- declarations automatically placed in tokens.h for each parsing func-
- tion that has a return value. If the -ga command-line option is used
- (on both ANTLR and DLG), PCCTS generates ANSI C compatible code and
- function prototype declarations; all parsing functions with return
- values or parameter definitions have appropriate prototypes placed in
- tokens.h.
-
- According to the ANSI C standard, a compliant compiler must
- define the macro name __STDC__, which would not be defined by a K&R C
- compiler. Hence, code can be written to work correctly with both ANSI
- C and K&R C by using #ifdef to test if __STDC__ has been defined.
- Although the output of PCCTS would have been excessively large if
- #ifdef had been used to make the code be acceptable to both ANSI C and
- K&R C compilers, #include files and other support code may test for
- the macro __STDC__.
-
- 6.3. Bugs
-
- Occasionally, PCCTS will find a grammar that it cannot analyze
- and you will receive an error similar to the following.
-
-
-
-
- file.c, line 43: out of memory while analyzing alts 2 and 5 of (..)+
-
-
-
- This is due to LL(k) grammar analysis requiring exponential space in
- some cases. Normally, this problem only will occur with k > 2 for
- large, ambiguous, grammars. A technique which avoids the worst-case
- exponential space requirement is being considered for a future
- release.
-
- Although not technically a "bug," the differences between B
- release PCCTS and version 1.00 will cause numerous errors to be
-
-
-
- Page 82
-
- PCCTS 1.00
-
-
- reported if old grammars are submitted to the new system. Most of
- these incompatibilities require only minor changes; see section 7.1.
- This was expected - why do you think we called it B?
-
- The code generated using the -ga option has not been tested for
- compliance with the ANSI C standard.
-
- The source code for ANTLR itself is not ANSI compliant, although
- it is fairly close.
-
- Error messages reporting ambiguities involving nullable subrules
- (i.e., {} or ()*) sometime blame the ambiguity on an alternative
- rather than the nullable construct.
-
- The names of PCCTS variables and/or functions are not necessarily
- consistent, mnemonic or convenient. PCCTS symbols are, therefore,
- subject to change.
-
- ANTLR will not compile on any system that does not have _iobuf as
- the standard FILE * structure name. This is because struct _iobuf *
- is used in a header file instead of FILE *. The header file is
- automatically generated via proto--the C example distributed with
- PCCTS. You can circumvent this problem by changing the declaration of
- struct _iobuf * in proto.h to FILE *.
-
- ANTLR normally checks to see that you do not reference an attri-
- bute (parameter, return value, normal attribute) that is not defined.
- However, if you use a $name where name is a substring of a valid
- parameter or return value, it thinks that it is ok.
-
- Invoking an ANTLR parser multiple times or invoking multiple
- parsers does not work very well; tokens tend to be lost.
-
- 7. History
-
- PCCTS Version 1.00 was written using B release PCCTS - 1.0B.
- 1.0B was, in turn, written in alpha release PCCTS, 1.0A, which was
- written in YUCC (a graduate class project gone wild). YUCC was writ-
- ten in PG; both of which were simple BNF-style recursive descent
- parser generators. PG was written in RD [JuD84].
-
- 7.1. Differences from B release PCCTS
-
- PCCTS Version 1.00 is significantly different from B release
- PCCTS. However, Version 1.00 is mostly a superset - most old grammars
- require only minor changes for version 1.00. This section delineates
- all of the changes and a few possible migration paths.
-
- o #attrib is now called #header to reflect the more general use of
- the construct in version 1.00.
-
- o The ANTLR initiation macro now requires that the starting rule
- have () appended just as you would call the function by hand.
- For example, ANTLR(start(), stdin).
-
-
-
- Page 83
-
- PCCTS 1.00
-
-
- o A number of internal functions and variables have been renamed so
- that conflicts with user-defined names are less likely. All the
- new names begin with the prefix zz. For example:
-
- o aCreate() is now zzcr_attr().
-
- o cur_char is now called zzchar, but the user should not need
- to reference it.
-
- o LexSkip and LexMore are now zzskip and zzmore.
-
- o Token is now called LA(1) (first token of look-ahead). LA(i) is
- the ith token.
-
- o LexText is now LATEXT(1) (text for first token of look-ahead).
- LATEXT(i) is the text for the ith token.
-
- o ANTLRi no longer exists.
-
- o Rules no longer automatically define an attribute parameter.
- Also, parameters of the form {...} (i.e. rule[{...}]) cannot be
- used. Basically, parameter passing to a rule is exactly like in
- a programming language. All parameters must be defined and can
- be any valid type. Also, return values may be defined for rules
- rather than using old-style upward attributes. Rules may simu-
- late the old-style $0 inheritance via
-
-
-
-
- rule[Attrib inh] : <<$0 = inh;>> ... ;
-
-
-
- o A new file called mode.h is created by DLG and is included into
- your ANTLR parser. This implies that DLG must be run before you
- can compile the C files created by ANTLR.
-
- o New command-line options have appeared, old ones have been
- changed and/or disappeared. To be precise,
-
-
- -T is now -gd.
-
- -z is gone.
-
- -n is now -gx.
-
- -l is now -fl.
-
- -c is now -gc.
-
- -I is gone.
-
-
-
-
- Page 84
-
- PCCTS 1.00
-
-
- -s is gone.
-
- -fi is gone.
-
- -fo is gone.
-
- -R is now -cr.
-
- -e is gone.
-
- -a is gone.
-
- -w is now one of -e1, -e2, -e3.
-
- -d is gone.
-
- -E is gone since the user can redefine zzsyn().
-
- o zzreplchar() and zzreplstr() should be used instead of modifying
- the lexical buffer manually.
-
- o A pointer to a function called zzerr is set to a function to call
- upon lexical error (invalid token etc...).
-
- 7.2. Future directions
-
- Neither we nor Purdue University guarantee that PCCTS works or
- accept liability if it does not do what you want it to. However, we
- do try to maintain and improve the system. So, we value user feedback
- that might help us improve PCCTS.
-
- Within Purdue University's School of Electrical Engineering, we
- are constantly updating PCCTS. The 1.00 release does not include all
- the latest goodies - it contains only those features whose implementa-
- tion we believe to be useful, correct, and reasonably stable. The
- following is a partial list of improvements currently being con-
- sidered:
-
- Parser construction
- Currently, ANTLR parsers use a sequence of if statements to find
- the correct alternative to match. This is inefficient if only
- one token of look-ahead is required to uniquely determine which
- alternative to choose. Depending on the C compiler, a switch
- statement would generate much faster code.
-
- Attribute and AST variables are maintained on software stacks in
- 1.00. Grammar analysis could easily create a collection of local
- hardware-stack-based variables to use in place of the software
- stack. This would eliminate the need to specify an attribute or
- AST stack size and produce faster code.
-
- Inline rules
- For the most part, subrules are like macro-expansions of rules.
- Rather than referencing a rule, the user simply instantiates the
-
-
-
- Page 85
-
- PCCTS 1.00
-
-
- rule in place of the rule reference. Grammatically the two
- methods are the same, but they differ in two important ways.
- First, subrules cannot employ the sophisticated parameter passing
- mechanism available to rules. Second, subrules are much faster
- than rule references since subrules avoid the function call over-
- head. A future version of PCCTS might introduce inline rules.
- Inline rules would be defined as before (with the addition of the
- inline keyword) but would be instantiated into rules that refer-
- ence them. This would cut down on grammar clutter by breaking up
- long rules while maintaining the speed advantage of subrules.
-
- Abstract-syntax-trees
- Through experience with PCCTS AST's, a new group of tree manipu-
- lation routines should appear. Some of them could relate to
- code-generation.
-
- Example grammars
- More and more grammars are being built with PCCTS. Some of them
- might be made available through future PCCTS releases and through
- users' network postings.
-
- Recognition
- Currently, PCCTS parsers cannot recognize as large a class of
- languages as bottom-up parsers. Research is underway to create
- hybrid parsers that would retain top-down's programmer-interface
- features while gaining bottom-up's recognition strength.
-
- Code generation
- Research continues at Purdue toward a code-generator generator
- for parallel machines (PIG).
-
- A simple uni-processor code-generator that has been in use at
- Purdue may be made available in the next release of PCCTS.
-
- $tokenPCCTS only recognizes $i, $i.j, $rule variables presently.
- Future versions might allow $token variables which will refer to
- attributes for tokens by the name of the token not the position
- number.
-
- 7.3. Acknowledgements
-
- Thanks is due to the users of B release PCCTS for their excellent
- suggestions. In particular, Mark Nichols (PhD candidate in Electrical
- Engineering at Purdue) discovered many of the initial quirks and bugs.
- We acknowledge Dana Hoggatt (Micro Data Base Systems Inc) for his idea
- of error grouping (strings attached to non-terminals) and his signifi-
- cant software testing efforts. Thanks is due Professor Matthew
- O'Keefe and Peter Dahl (both at University of Minnesota) for their
- review of the user's manual and their bug-finding escapades into
- PCCTS.
-
-
-
-
-
-
-
- Page 86
-
- PCCTS 1.00
-
-
- 8. References
-
- [ANS90]
- American National Standard for Information Systems, "Programming
- Language C," Document X3J11/90-013, February 1990.
-
- [AhU79]
- A.V. Aho, J.D. Ullman, Principles of Compiler Design, Addison-
- Wesley Publishing Company, Reading, Massachusetts, 1979.
-
- [DoP90]
- H. Dobler and K. Pirklbauer, "Coco-2 A New Compiler Compiler,"
- ACM SIGPLAN Notices, Vol. 25, No. 5, May 1990.
-
- [Joh78]
- Stephen C. Johnson, "Yacc: Yet Another Compiler-Compiler," Bell
- Laboratories, Murray Hill, NJ, 1978.
-
- [JuD84]
- R. Juels, H. Dietz, S. Arbeeny, J. Patane, E. Tepe, "Automatic
- Translation Systems," Study Report, Center For Digital Systems,
- Department of Electrical Engineering and Computer Science, The
- Polytechnic Institute of New York, New York, 1984.
-
- [KeR88]
- Brian W. Kernighan and Dennis M. Ritchie, "The C programming
- Language," Prentice Hall Inc., Englewood Cliffs, New Jersey,
- 1988.
-
- [Les75]
- M. E. Lesk, "LEX--a Lexical Analyzer Generator," CSTR 39 Bell
- Laboratories, Murray Hill, NJ, 1975.
-
- [LeP81]
- Harry R. Lewis, Christos H. Papadimitriou, Elements of the Theory
- of Computation, Prentice Hall, Inc., Englewood Cliffs, New Jer-
- sey, 1981.
-
- [Nau63]
- P. Naur (ed.), "Revised report on the algorithmic language ALGOL
- 60," Communications of the ACM 6:1, 1-17.
-
- [PaD90]
- Terence Parr, Henry Dietz, Will Cohen, "Purdue Compiler-
- Construction Tool Set," Technical Report TR-EE 90-14, School Of
- Electrical Engineering, Purdue University, West Lafayette, IN,
- February 1990.
-
- [Par90]
- Terence Parr, "The Analysis of Extended BNF Grammars and the Con-
- struction of LL(1) Parsers," Technical Report TR-EE 90-30, School
- Of Electrical Engineering, Purdue University, West Lafayette, IN,
- May 1990.
-
-
-
-
- Page 87
-
- PCCTS 1.00
-
-
- [PuC89]
- James J. Purtilo and John R. Callahan, "Parse-Tree Annotations",
- Communications of the ACM, Vol. 32, No. 12, December 1989.
-
- [Str87]
- Bjarne Stroustrup, "The C++ Programming Language," Addison-Wesley
- Publishing Company, Reading, Massachusetts, 1987.
-
- [Wir76]
- Niklaus Wirth, Algorithms + Data Structures = Programs, Prentice
- Hall, Inc., Englewood Cliffs, New Jersey, 1976.
-
- Notes:[LeP81] credits P. Naur [Nau63] with Backus-Naur Form (BNF) and
- is a reasonable text on language theory. [AhU79] gave more com-
- plete references to Johnson's YACC and Lesk's LEX - i.e. the CSTR
- numbers from Bell Laboratories.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- PCCTS 1.00
-
-
-
-
-
- Table of Contents
-
-
-
-
- 1. Introduction ................................................. 1
-
- 1.1. PCCTS programming interface ....................... 2
-
- 1.2. PCCTS information flow ............................ 3
-
- 1.3. PCCTS file(s) format .............................. 3
-
- 1.4. Makefile template ................................. 5
-
- 1.5. PCCTS component command-line options .............. 6
-
- 1.6. Introductory example .............................. 10
-
- 2. Lexical analysis and token definition ........................ 13
-
- 2.1. Token Labeling and token actions .................. 13
-
- 2.2. Lexical actions via #lexaction .................... 16
-
- 2.3. Multiple lexical classes .......................... 16
-
- 2.3.1. Multiple grammars, multiple lexical
- analyzers ....................................................... 17
-
- 2.3.2. Single grammar, multiple lexical
- analyzers ....................................................... 18
-
- 2.4. Handling end of input ............................. 18
-
- 2.5. Token order and lexical ambiguities ............... 19
-
- 2.6. Quoted tokens ..................................... 19
-
- 2.7. Interactive lexical analyzers ..................... 21
-
- 2.8. DLG lexical input ................................. 21
-
- 2.9. Tracking pattern position ......................... 22
-
- 3. Syntactic analysis ........................................... 22
-
- 3.1. PCCTS rule definitions ............................ 23
-
- 3.1.1. Subrules ............................... 24
-
-
-
-
-
-
- PCCTS 1.00
-
-
- 3.1.2. Rule communication ..................... 25
-
- 3.1.3. Miscellaneous notes .................... 27
-
- 3.1.4. LL(k) parsing .......................... 28
-
- 3.2. Grammar ambiguities ............................... 28
-
- 3.3. Look-ahead size ................................... 30
-
- 3.4. ANTLR parser construction ......................... 30
-
- 3.4.1. Efficiency ............................. 31
-
- 3.4.2. Debugging parsers ...................... 32
-
- 3.5. PCCTS parser template ............................. 32
-
- 3.6. Grammatical actions ............................... 34
-
- 3.6.1. Init Actions ........................... 34
-
- 3.6.2. Fail actions ........................... 35
-
- 3.6.3. Actions appearing outside of rules ..... 36
-
- 4. Attribute handling ........................................... 37
-
- 4.1. General attribute handling ........................ 37
-
- 4.1.1. Attribute creation ..................... 38
-
- 4.1.2. Attribute destruction .................. 40
-
- 4.1.3. Standard attribute definitions ......... 41
-
- 4.1.4. Attribute references ................... 42
-
- 4.1.4.1. $i references ............... 43
-
- 4.1.4.2. $i.j references ............. 44
-
- 4.1.4.3. $label references ........... 45
-
- 4.1.5. $[token, text] attribute constructor
-
- 4.1.6. Inheritance ............................ 47
-
- 4.1.6.1. $0 inheritance .............. 47
-
- 4.1.6.2. Generalized rule communi-
- cation .......................................................... 48
-
- 4.1.6.2.1. Downward-
-
-
-
-
-
- PCCTS 1.00
-
-
- inheritance ..................................................... 48
-
- 4.1.6.2.2. Upward-
- inheritance ..................................................... 50
-
- 4.2. Abstract-syntax-tree construction and manipula-
- tion ............................................................ 50
-
- 4.2.1. AST node creation ...................... 51
-
- 4.2.2. AST node destruction ................... 54
-
- 4.2.3. Automatic AST construction ............. 55
-
- 4.2.3.1. Grammar annotation .......... 55
-
- 4.2.3.2. Operator precedence ......... 57
-
- 4.2.3.3. Example ..................... 58
-
- 4.2.4. Explicit AST construction .............. 61
-
- 4.2.4.1. Simple example .............. 62
-
- 4.2.4.2. C declaration to English
- translator ...................................................... 64
-
- 4.2.5. Tree manipulation routines ............. 69
-
- 4.2.6. AST references ......................... 71
-
- 4.2.7. Invoking parsers that construct trees
-
- 4.2.8. Error recovery and AST's ............... 73
-
- 5. Error detection and repair ................................... 73
-
- 5.1. Modifying grammars to recognize common syntax
- errors .......................................................... 74
-
- 5.2. Default error reporting function .................. 75
-
- 5.3. Error groups ...................................... 75
-
- 5.4. Error token classes ............................... 76
-
- 5.4.1. Error class definitions ................ 77
-
- 5.4.2. Error class utilization ................ 79
-
- 6. Things you should know about ................................. 80
-
- 6.1. Available program symbols ......................... 80
-
-
-
-
-
-
- PCCTS 1.00
-
-
- 6.1.1. Macros, functions, #define's ........... 80
-
- 6.1.2. Global variables ....................... 81
-
- 6.2. ANSI versus K&R ................................... 82
-
- 6.3. Bugs .............................................. 82
-
- 7. History ...................................................... 83
-
- 7.1. Differences from B release PCCTS .................. 83
-
- 7.2. Future directions ................................. 85
-
- 7.3. Acknowledgements .................................. 86
-
- 8. References ................................................... 87
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-